REPORT  DOCUMENTATION  PAGE 


Form  Approved 
0MB  No.  074-0188 


Public  reporting  burden  for  this  collection  of  information  is  estimated  to  average  1  hour  per  response,  including  g  the  time  for  reviewing  instructions,  searching  existing  data  sources, 
gathering  and  maintaining  the  data  needed,  and  completing  and  reviewing  the  collection  of  information.  Send  comments  regarding  this  burden  estimate  or  any  other  aspect  of  the 
collection  of  information,  including  suggestions  for  reducing  this  burden  to  Washington  Headquarters  Services,  Directorate  for  Information  Operations  and  Reports,  1215  Jefferson 
Davis  Highway,  Suite  1204,  Arlington,  VA  22202-4302,  and  to  the  Office  of  Management  and  Budget,  Paperwork  Reduction  Project  (0704-0188),  Washington,  DC  20503. 


1.  AGENCY  USE  ONLY  (Leave  blank)  2.  REPORT  DATE 

3  May  2007 


4.  TITLE  AND  SUBTITLE 

Autonomous  Detection  and  Imaging  of  Abandoned  Luggage  in 
Real  World  Environments 


6.  AUTHOR(S) 

Papon ,  Jeremie  A . 


3.  REPORT  TYPE  AND  DATE  COVERED 


7.  PERFORMING  ORGANIZATION  NAME(S)  AND  ADDRESS(ES) 


9.  SPONSORING/MONITORING  AGENCY  NAME(S)  AND  ADDRESS(ES) 


US  Naval  Academy 
Annapolis,  MD  21402 


10.  SPONSORING/MONITORING  AGENCY  REPORT  NUMBER 


Trident  Scholar  project  report  no. 
357  (2007) 


12a.  DISTRIBUTION/AVAILABILITY  STATEMENT 

This  document  has  been  approved  for  public  release;  its  distribution 
is  UNLIMITED. 


12b.  DISTRIBUTION  CODE 


13.  ABSTRACT  .  This  Trident  Project  developed  a  system  that  is  able  to  detect  and  produce  high 
resolution  imagery  of  unattended  items  in  a  crowded  scene,  such  as  an  airport,  using  live  video 
processing  techniques.  Video  surveillance  is  commonplace  in  today's  public  areas,  but  as  the  number  of 
cameras  increases,  so  do  the  human  resources  required  to  monitor  them.  Additionally,  current 
surveillance  networks  are  restricted  by  the  low  resolution  of  their  cameras.  For  example,  while  there  is 
an  extensive  security  camera  network  in  the  London  Underground,  its  low  resolution  prevented  it  from 
being  used  to  automatically  identify  the  terrorists  that  entered  the  train  stations  in  July  2005.  With 
this  in  mind,  this  project  developed  a  surveillance  system  that  is  able  to  autonomously  monitor  a  scene 
for  suspicious  events  by  combining  a  low  resolution  camera  for  surveillance  (a  webcam)  with  a  moving 
high  resolution  camera  (a  6  mega-pixel  digital  still-frame  camera)  to  provide  a  greater  level  of  detail. 
This  enhanced  capability  is  used  to  determine  whether  or  not  the  event  is  a  threat.  For  the  purposes  of 
this  research,  suspicious  events  were  defined  as  a  person  leaving  a  piece  of  luggage  unattended  for  an 
extended  period  of  time.  Initial  analysis  of  the  surveillance  video  involved  separating  the  foreground 
(such  as  people  carrying  luggage)  from  the  background.  In  order  to  do  this  using  live  video,  an 
automated  algorithm  was  developed  which  creates  a  composite  background  image  from  a  small  number  of 
video  frames. 


14.  SUBJECT  TERMS 

Biometrics,  Motion  Detection,  Security  Camera, 
Surveillance,  Video 


15.  NUMBER  OF  PAGES 
82 


16.  PRICE  CODE 


17.  SECURITY  CLASSIFICATION 
OF  REPORT 


NSN  7540-01-280-5500 
(Rev.2-89) 


18.  SECURITY  CLASSIFICATION 

19.  SECURITY  CLASSIFICATION 

OF  THIS  PAGE 

OF  ABSTRACT 

20.  LIMITATION  OF  ABSTRACT 


Standard  Form  298 

Prescribed  by  ANSI  Std.  Z39-18 
298-102 


REPORT  DOCUMENTATION  PAGE 

Form  Approved 

0MB  No.  074-0188 

Public  reporting  burden  for  this  collection  of  information  is  estimated  to  average  1  hour  per  response,  including  g  the  time  for  reviewing  instructions,  searching  existing  data  sources, 
gathering  and  maintaining  the  data  needed,  and  completing  and  reviewing  the  collection  of  information.  Send  comments  regarding  this  burden  estimate  or  any  other  aspect  of  the 
collection  of  information,  including  suggestions  for  reducing  this  burden  to  Washington  Headquarters  Services,  Directorate  for  Information  Operations  and  Reports,  1215  Jefferson 

Davis  Highway,  Suite  1204,  Arlington,  VA  22202-4302,  and  to  the  Office  of  Management  and  Budget,  Paperwork  Reduction  Project  (0704-0188),  Washington,  DC  20503. 

1.  AGENCY  USE  ONLY  (Leave  blank) 

2.  REPORT  DATE 

3  May  2007 

3.  REPORT  TYPE  AND  DATE  COVERED 

4.  TITLE  AND  SUBTITLE 

Autonomous  Detection  and  Imaging  of  Abandoned  Luggage  in 
Real  World  Environments 

5.  FUNDING  NUMBERS 

6.  AUTHOR(S) 

Papon ,  Jeremie  A . 

7.  PERFORMING  ORGANIZATION  NAME(S)  AND  ADDRESS(ES) 

8.  PERFORMING  ORGANIZATION  REPORT  NUMBER 

9.  SPONSORING/MONITORING  AGENCY  NAME(S)  AND  ADDRESS(ES) 

10.  SPONSORING/MONITORING  AGENCY  REPORT  NUMBER 

US  Naval  Academy 
Annapolis,  MD  21402 

Trident  Scholar  project  report  no. 

357  (2007) 

11.  SUPPLEMENTARY  NOTES 

12a.  DISTRIBUTION/AVAILABILITY  STATEMENT 

This  document  has  been  approved  for  public  release;  its  distribution 
is  UNLIMITED. 

12b.  DISTRIBUTION  CODE 

13.  ABSTRACT  (CONT.)In  the  algorithm,  areas  detected  as  motion  were  removed  from  individual  frames. 
These  processed  frames,  which  represented  regions  of  no  motion,  were  then  combined  by  taking  the  median 
value  for  each  pixel  across  all  frames.  These  median  values  were  used  to  form  a  composite  background  image 
which  contained  only  the  non-moving  parts  of  the  scene  (i.e.,  the  background)  .  Once  this  background  was 
obtained,  the  system  then  detected  live  motion.  Using  a  variety  of  filtering  techniques,  individual 
foreground  objects  were  separated  from  the  stationary  background.  These  objects  were  then  tracked  over  time. 
When  an  object  (for  the  purposes  of  this  research,  a  moving  object  was  assumed  to  be  a  person)  divided  into 
two  different  objects,  they  were  then  tracked  to  see  if  one  of  the  objects  remained  motionless.  In  doing  so, 
the  system  was  able  to  detect  an  "abandonment"  event.  When  such  an  event  occurred,  an  event  timer  began  to 
determine  how  long  the  luggage  had  been  abandoned.  If  the  luggage  was  left  unattended  for  a  preset  amount  of 
time,  the  system  tagged  it  for  high  resolution  imaging.  Once  an  abandoned  item  was  tagged  for  high 
resolution  imaging,  the  system  used  a  motorized  pan/tilt  mount  to  point  the  high  resolution  camera  and 
acquire  a  high  resolution  image  of  the  item.  This  image  was  then  sent  to  a  human  supervisor  for  further 
investigation.  The  final  security  system  can  allow  a  single  person  to  monitor  a  vast  array  of  camera  systems 
(spanning  for  example,  an  entire  airport)  for  abandoned  luggage  or  any  other  pre-defined  suspicious  event 

14.  SUBJECT  TERMS  Biometrics ,  Motion  Detection, 

Security  Camera,  Surveillance,  Video 

15.  NUMBER  OF  PAGES 

82 

16.  PRICE  CODE 

17.  SECURITY  CLASSIFICATION 

OF  REPORT 

18.  SECURITY  CLASSIFICATION 

OF  THIS  PAGE 

19.  SECURITY  CLASSIFICATION 

OF  ABSTRACT 

20.  LIMITATION  OF  ABSTRACT 

NSN  7540-01-280-5500  Standard  Form  298 

(Rev.2-89)  Prescribed  by  ANSI  Std.  Z39-18 


298-102 


1 

Project  Abstract 

This  Trident  Project  developed  a  system  that  is  able  to  detect  and  produce  high 
resolution  imagery  of  unattended  items  in  a  crowded  scene,  such  as  an  airport,  using  live  video 
processing  techniques.  Video  surveillance  is  commonplace  in  today’s  public  areas,  but  as  the 
number  of  cameras  increases,  so  do  the  human  resources  required  to  monitor  them.  Additionally, 
current  surveillance  networks  are  restricted  by  the  low  resolution  of  their  cameras.  For  example, 
while  there  is  an  extensive  security  camera  network  in  the  London  Underground,  its  low 
resolution  prevented  it  from  being  used  to  automatically  identify  the  terrorists  that  entered  the 
train  stations  in  July  2005.  With  this  in  mind,  this  project  developed  a  surveillance  system  that  is 
able  to  autonomously  monitor  a  scene  for  suspicious  events  by  combining  a  low  resolution 
camera  for  surveillance  (a  webcam)  with  a  moving  high  resolution  camera  (a  6  mega-pixel 
digital  still-frame  camera)  to  provide  a  greater  level  of  detail.  This  enhanced  capability  is  used  to 
determine  whether  or  not  the  event  is  a  threat. 

For  the  purposes  of  this  research,  suspicious  events  were  defined  as  a  person  leaving  a 
piece  of  luggage  unattended  for  an  extended  period  of  time.  Initial  analysis  of  the  surveillance 
video  involved  separating  the  foreground  (such  as  people  carrying  luggage)  from  the 
background.  In  order  to  do  this  using  live  video,  an  automated  algorithm  was  developed  which 
creates  a  composite  background  image  from  a  small  number  of  video  frames.  In  the  algorithm, 
areas  detected  as  motion  were  removed  from  individual  frames.  These  processed  frames,  which 
represented  regions  of  no  motion,  were  then  combined  by  taking  the  median  value  for  each  pixel 
across  all  frames.  These  median  values  were  used  to  form  a  composite  background  image  which 
contained  only  the  non-moving  parts  of  the  scene  (i.e.,  the  background). 


2 

Once  this  background  was  obtained,  the  system  then  detected  live  motion.  Using  a 
variety  of  filtering  techniques,  individual  foreground  objects  were  separated  from  the  stationary 
background.  These  objects  were  then  tracked  over  time.  When  an  object  (for  the  purposes  of  this 
research,  a  moving  object  was  assumed  to  be  a  person)  divided  into  two  different  objects,  they 
were  then  tracked  to  see  if  one  of  the  objects  remained  motionless.  In  doing  so,  the  system  was 
able  to  detect  an  “abandonment”  event.  When  such  an  event  occurred,  an  event  timer  began  to 
determine  how  long  the  luggage  had  been  abandoned.  If  the  luggage  was  left  unattended  for  a 
preset  amount  of  time,  the  system  tagged  it  for  high  resolution  imaging. 

Once  an  abandoned  item  was  tagged  for  high  resolution  imaging,  the  system  used  a 
motorized  pan/tilt  mount  to  point  the  high  resolution  camera  and  acquire  a  high  resolution  image 
of  the  item.  This  image  was  then  sent  to  a  human  supervisor  for  further  investigation.  The  final 
security  system  can  allow  a  single  person  to  monitor  a  vast  array  of  camera  systems  (spanning 
for  example,  an  entire  airport)  for  abandoned  luggage  or  any  other  pre-defmed  suspicious  event. 

Keywords:  Biometrics,  Motion  Detection,  Security  Camera,  Surveillance,  Video 


3 


Acknowledgments 

The  author  wishes  to  express  his  appreciation  to: 

Dr.  Robert  Ives,  Electrical  Engineering  Department,  USNA  -  Primary  Project  Adviser 
Dr.  Randy  Broussard,  Weapons  and  Systems  Engineering  Department,  USNA  -Secondary 
Project  Adviser 

Mr.  Jeffery  Dunn,  National  Security  Agency  -External  Collaborator 
Ms.  Daphi  Jobe,  Electrical  Engineering  Department,  USNA-  Electronics  Technician 
Mr.  Jerry  Ballman,  Electrical  Engineering  Department,  USNA-  Laboratory  Technician 
Mr.  Michael  Wilson,  Electrical  Engineering  Department,  USNA  -  Laboratory  Technician 


Table  of  Contents 


4 


1.  The  Surveillance  Problem . 6 

2.  System  Components . 7 

3.  Interfacing  of  Components . 9 

4.  Composite  Background  Creation . 12 

5.  Foreground  Segmentation . 14 

6.  Object  Tracking . 16 

7.  Testing . 18 

8.  Conclusions . 23 

9.  References . 26 

Appendix  A:  Glossary . 28 

Appendix  B:  Graphical  User  Interface . 30 

Appendix  C:  MATLAB  Code . 34 

Appendix  D:  Papers  Published . 38 


Appendix  E:  Original  Project  Report 


55 


5 


Table  of  Figures 

Figure  1:  Terrorists  in  London  Subway  on  July  7*''  2005 . 6 

Figure  2:  Logitech  Quickcam  Pro  5000 . 8 

Figure  3:  Canon  PowerShot  S3IS . 8 

Figure  4:  Directed  Perception  Pan/Tilt  Mount . 9 

Figure  5:  The  Assembled  System . 10 

Figure  6:  Graphical  User  Interface . 11 

Figure  7:  Composite  Background  Creation  Algorithm . 13 

Figure  8:  Example  of  Background  Creation  Using  Live  Video  Frames . 14 

Figure  9:  Foreground  Segmentation  Algorithm . 15 

Figure  10:  Example  of  Eoreground  Segmentation  in  London  Train  Station . 16 

Figure  11:  Foreground  Object  Tracking . 17 

Figure  12:  Example  of  Object  Tracking . 18 

Figure  13:  Time  to  Create  Composite  Background  Images . 19 

Figure  14:  Accuracy  of  Background  Creation  Algorithms . 20 

Figure  15:  Integrated  System  Testing  Results . 21 

Figure  16:  High  Resolution,  Poor  Lighting  Conditions . 22 

Figure  17:  High  Resolution,  Good  Lighting  Conditions . 22 


1.  The  Surveillance  Problem 


6 


Video  surveillance  is  extremely  common  in  modem  public  areas,  and  as  the  number  of 
cameras  increases,  so  do  the  number  of  video  feeds  to  monitor.  This  places  an  increasing  burden 
on  security  personnel,  as  more  and  more  must  be  assigned  to  the  task  of  monitoring  the  videos. 
Additionally,  a  person  is  only  able  to  watch  so  many  video  feeds  at  the  same  time  with  any  sort 
of  efficiency,  and  as  with  any  system  involving  human  monitors,  are  subject  to  human  error. 
Take  for  example  the  London  Subway  Bombings  of  July  7*  2005,  in  which  the  terrorists  passed 
directly  in  front  of  several  security  cameras  (Figure  1).  Even  though  the  terrorists  were  on 
several  watch  lists,  and  security  personnel  were  in  a  heightened  state  of  alert,  the  terrorists  were 
allowed  free  access  to  the  subways,  and  managed  to  leave  their  bags  (which  contained  bombs) 
throughout  the  London  Underground. 


Figure  1:  Terrorists  in  London  Subway  on  July  T""  2005  (United  Kingdom  Crown  Copyright,  available  at 
http://news.bbc.co.Uk/2/hi/uk_news/politics/4689739.stm) 


7 

In  addition  to  the  fallibility  of  human  monitors,  current  surveillance  networks  suffer 
from  another  major  drawback;  their  low  resolution.  This  severely  limits  the  ability  to  identify 
threats  via  automated  algorithms,  such  as  the  one  presented  in  this  report.  The  low  resolution  of 
current  security  systems  may  also  prevent  the  implementation  of  covert  automated  facial 
recognition  of  people  which  might  have  been  used  to  alert  security  personnel  to  the  presence  of 
terrorists  in  the  London  Underground.  As  such,  this  research  has  developed  a  dual  camera 
system,  which  combines  a  low  resolution  video  camera  with  a  high  resolution  still-frame  camera 
mounted  on  a  moving  pan/tilt  mount.  Combined  with  supporting  application  software,  this 
allows  for  automated  analysis  of  video  to  detect  objects  of  interest  and  investigate  them  using  the 
high  resolution  camera. 

2.  System  Components 

The  low  resolution  camera  is  a  Logitech  Quickcam  Pro  5000  (Figure  2).  This  camera  has 
proven  ideal  for  the  purpose  of  surveillance  because  it  combines  good  resolution  (640  x  480 
pixels),  fast  frame  rates  (up  to  30  frames  per  second),  and  a  wide  field  of  view  (82  degrees)  in  a 
small  robust  package.  In  this  system,  the  Quickcam  is  stationary,  and  surveys  an  entire  area  of 
interest  at  once,  continuously  monitoring  for  movement.  When  movement  in  the  field  of  view  is 
detected,  the  camera  video  output  is  processed  for  object  localization.  The  Quickcam  interfaces 
directly  with  MATLAB  (a  high-performance  programming  language  for  technical  computing) 
running  on  a  Personal  Computer  (PC)  via  a  Universal  Serial  Bus  (USB)  cable,  which  allows  for 
quick  processing  of  the  video  feed. 


8 


Figure  2:  Logitech  Quickcam  Pro  5000  (from  Logitech  at  http://www.logitech.com/) 

The  high  resolution  camera  consists  of  a  6  mega-pixel  (a  mega-pixel  is  one  million 

pixels)  Canon  Powershot  S3IS  (Figure  3).  This  camera  was  chosen  because  of  its  good  low-light 

performance  and  motorized  zoom.  The  twelve-times  optical  zoom  is  extremely  fast,  and  can  be 

controlled  remotely,  allowing  for  imaging  of  small  objects  at  relatively  long  distances.  The 

camera  is  mounted  on  a  high  speed  pan/tilt  mount  for  pointing,  and  comes  with  a  Software 

Development  Kit  (SDK)  which  allows  for  remote  control  of  the  camera  over  a  USB  cable. 


Figure  3:  Canon  PowerShot  S3IS  (from  Canon  USA  at  http://www.usa.canon.com/) 

A  pan/tilt  mount  is  a  device  that  can  interpret  input  digital  signals  to  control  its  position 
in  all  three  dimensions.  This  project  uses  a  Directed  Perception  PTU-D46-17  pan/tilt  mount 
which  can  pan  (left/right)  a  full  360  degrees  and  tilt  (up/down)  180  degrees,  at  speeds  of  over 
300  degrees  per  second.  The  mount  (Figure  4)  allows  for  fast,  stable,  and  accurate  pointing  of  the 
high  resolution  camera.  The  pan/tilt  mount  is  controlled  through  the  serial  port  of  the  PC,  and 
can  be  commanded  directly  from  a  Graphical  User  Interface  (GUI)  using  the  serial  port 


functionality  built  into  MATLAB.  The  GUI  portion  of  the  system  and  its  eontrol  of  the 
system’s  hardware  components  is  deseribed  in  the  next  section. 


Figure  4:  Directed  Perception  Pan/Tilt  Mount  (from  Directed  Perception  at  http://www.dperception.com/) 

3.  Interfacing  of  Components 

The  first  major  hurdle  in  developing  a  system  that  eombines  various  off-the-shelf 
components  into  a  single  working  unit  is  interfaeing.  Both  the  high  resolution  eamera  and  the 
pan/tilt  mount  required  separate  eontrols  to  be  developed  in  order  to  meet  their  desired 
funetionality.  This  was  aeeomplished  in  MATLAB. 

Control  of  the  Pan/Tilt  mount  was  accomplished  using  the  computer’s  serial  port.  The 
serial  port  is  used  to  send  commands  from  MATLAB  to  the  pan/tilt  mount’s  eontroller,  and 
allows  control  of  both  the  movement  of  the  mount  and  all  of  its  settings,  sueh  as  slew  rate 
(maximum  turning  rate)  and  aceeleration.  The  commands  to  move  the  pan/tilt  mount  which  serve 
to  point  the  high  resolution  camera  are  based  on  the  location  of  the  objects  in  the  view  of  the  low 
resolution  webeam. 

Controlling  the  Canon  Powershot  high  resolution  camera  proved  to  be  much  more 
difficult.  The  primary  reason  for  this  is  that  the  Canon  SDK,  whieh  provides  a  basie  framework 
for  communicating  with  their  cameras,  was  written  in  the  C  programming  language.  In  order  to 


10 

use  the  eamera,  all  of  the  control  functions  had  to  be  ported  into  MATLAB  using  MEX  files. 
MEX  files  are  a  specific  type  of  MATLAB  file  which  are  written  in  C,  but  can  compile  and  run 
in  MATLAB.  MEX  files  for  initializing  the  camera,  changing  camera  settings,  and  controlling 
the  shutter  were  created.  But  the  actual  transfer  of  images  from  the  high  resolution  camera  to  the 
computer  over  the  USB  cable  proved  problematic,  as  the  Canon  framework  for  doing  this  could 
not  be  ported  into  MATLAB.  As  such,  new  functions  were  written  in  C  which  transferred  the 
image  data  from  the  camera  over  the  USB  cable  in  individual  packets  instead  of  one  large  file 
transfer.  Packet  sizes  in  initial  testing  were  limited  to  only  1024  bytes,  which  resulted  in  very 
poor  transfer  speeds  for  images  (upwards  of  5  seconds).  After  several  adjustments  and  revisions 
to  the  transfer  protocol,  128  kilobyte  packets  were  used  successfully,  which  increased  transfer 
speeds  substantially  (approximately  one  second  for  a  one  megabyte  image).  The  complete 
assembled  system  can  be  seen  in  Figure  5. 


Figure  5:  The  Assembled  System 


11 

The  entire  system  is  eontrolled  using  a  MATLAB  GUI  that  was  compiled  into  a 
standard  Windows  executable  file  for  faster  execution  speed.  The  GUI,  shown  in  Figure  6, 
allows  a  human  operator  to  control  all  of  the  system  components  using  a  single  interface.  It 
allows  the  operator  to  adjust  settings  for  each  component,  view  the  low  resolution  video  feed, 
view  a  graphical  representation  of  the  object  tracking  algorithm,  and  view  any  high  resolution 
images  acquired.  The  GUI  has  built-in  options  to  allow  for  the  control  of  several  camera  systems 
at  once;  this  allows  for  the  control  of  an  entire  surveillance  network  from  one  central  computer. 
A  detailed  description  of  the  GUI’s  functionality  is  contained  in  Appendix  A,  and  the  algorithms 
that  run  the  controls  are  described  in  the  following  sections. 


Figure  6:  Graphical  User  Interface 


12 


4.  Composite  Background  Creation 

One  approach  to  detecting  abandoned  luggage  in  a  video  frame  is  to  compare  the  frame 
to  an  image  of  the  same  scene  that  has  no  abandoned  luggage  in  it  (the  background).  The  first 
element  necessary  for  accurate  analysis  of  a  live  surveillance  feed  was  therefore  the  creation  of 
an  accurate  background  image.  This  is  easy  in  a  laboratory,  where  one  can  simply  clear  the  area 
of  people  and  other  foreground  objects.  Unfortunately,  in  real  world  applications  it  can  be 
difficult,  if  not  impossible,  to  clear  an  area  in  order  to  get  a  background  image.  Additionally, 
even  if  one  clears  an  area  and  obtains  an  accurate  background  image  when  a  camera  is  installed, 
any  changes  in  the  background  (such  as  a  light  burning  out,  or  furniture  being  moved)  can 
severely  degrade  the  system’s  ability  to  separate  foreground  from  background.  There  are  several 
methods  that  have  been  developed  to  create  composite  backgrounds,  most  common  of  which  are 
Kalman  filtering  [1]  and  mixture  of  Gaussians  [2].  For  the  purposes  of  this  research,  none  of 
these  techniques  were  computationally  efficient  enough  for  real-time  applications,  so  a  new 
technique  was  developed  that  is  far  more  efficient  while  still  attaining  equal,  if  not  better,  levels 
of  accuracy.  With  this  in  mind,  this  research  has  developed  a  robust  algorithm  which  is  able  to 
take  live  video  and  dynamically  create  a  “composite”  background  image  (Figure  7).  Using  this 
algorithm,  a  background  can  be  created  even  with  people  present  in  the  field  of  view  of  the 
camera.  Additionally,  because  the  algorithm  works  quickly  and  can  be  applied  to  live  video,  it 
can  be  used  to  periodically  update  the  background  image  automatically  to  account  for  any 
changes  to  the  background. 


13 


Difference  and 


I 

I 

r - 


Figure  7:  Composite  Background  Creation  Algorithm 


The  first  step  in  the  background  creation  is  simple  frame-to-frame  differencing  and 
thresholding  to  detect  regions  that  contain  movement.  Thresholding  creates  a  binary  image  (or 
mask)  with  values  of  zero  or  one  in  every  pixel  location,  where  zero  corresponds  to  motion  and 
one  to  motionless  (background).  Threshold  values  for  creation  of  the  binary  image  are  calculated 
using  Otsu’s  method  [3].  This  difference  mask  is  then  processed  with  binary  morphology 
(dilation)  to  account  for  noise  in  the  thresholding  and  reduce  the  possibility  of  missing  actual 
motion.  Noise  occurs  naturally,  since  lighting  conditions  vary  continually  over  time,  even  when 
humans  perceive  constant  illumination.  Under  electrical  lighting,  this  is  due  in  variations  in  the 
electric  power  source.  The  mask  is  then  multiplied  by  the  current  video  frame  to  create  a 
motionless  image  frame,  which  contains  only  background  areas,  and  is  then  stored.  This  process 
is  repeated  until  every  pixel  location  has  at  least  five  “motionless”  values.  Once  the  array  of 


14 

motionless  images  is  created  the  median  value  for  each  pixel  is  calculated,  and  used  to  form 
the  final  composite  background  image.  In  testing  this  composite  background  creation  process 
was  shown  to  take  generally  about  4  seconds  (at  15  frames  per  second)  for  normal  surveillance 
video  clips.  An  example  of  the  background  creation  is  shown  in  Figure  8.  The  surveillance  clip 
used  comes  from  a  London  train  station,  filmed  for  the  “Performance  Evaluation  of  Tracking  and 
Surveillance  Conference  2006”  (PETS  2006)  [4].  The  composite  background  image  is  then  used 
in  detecting  and  segmenting  the  foreground  for  follow-on  video. 


Figure  8:  Example  of  Background  Creation  Using  Live  Video  Frames 


5.  Foreground  Segmentation 

Once  a  composite  background  image  is  created,  foreground  segmentation  is  possible. 
This  is  accomplished  using  the  algorithm  shown  in  the  block  diagram  of  Figure  9.  The  first  step 
in  the  process  is  differencing  and  thresholding  both  the  current  video  frame  and  the  previous 
frame  with  the  background  image  to  create  binary  masks,  where  binary  1  represents  regions  not 
in  the  background.  These  masks  are  then  filtered  using  a  majority  filter  and  then  morphologically 
closed.  This  majority  filter  operates  in  a  3-by-3  neighborhood  around  each  pixel  in  each  binary 
mask,  and  sets  that  pixel  to  a  value  of  one  if  there  are  five  or  more  ones  in  its  neighborhood 
(otherwise  it  sets  the  pixel  to  zero)  [5].  These  two  different  masks  (derived  from  the  current  and 


15 

previous  video  frames)  are  then  combined  using  a  logical  AND  function  to  create  a  foreground 
mask. 


Figure  9:  Foreground  Segmentation  Algorithm 


This  process  of  combining  the  current  frame  with  the  previous  frame  was  shown  to  be 
extremely  effective  in  testing,  and  resulted  in  more  accurate  segmentation  results  than  could  be 
attained  by  using  only  the  current  frame.  An  example  of  the  foreground  segmentation  process 
applied  to  the  PETS  2006  data  set  is  shown  in  Figure  10.  Once  the  foreground  mask  is  obtained, 
the  algorithm  moves  on  to  the  next  stage,  object  tracking. 


16 


Figure  10:  Example  of  Foreground  Segmentation  in  London  Train  Station 

Upper  Left:  Input  video  frame. 

Upper  Right:  Cutout  showing  foreground  segmented. 

Lower  Left:  Foreground  mask  applied  to  background  frame. 

Lower  Right:  Binary  foreground  mask. 


6.  Object  Tracking 

The  final  stage  of  the  algorithm  (shown  in  Figure  1 1)  is  the  actual  tracking  of  foreground 
objects.  Foreground  objects  consist  of  regions  of  contiguous  binary  I’s  in  the  foreground  mask. 
The  size  and  center  of  each  of  these  regions  is  calculated,  and  compared  with  the  objects  present 
in  the  previous  iteration.  If  an  object  is  “new”,  it  is  stored  in  an  object  structure.  An  object 
structure  is  a  block  of  data  which  contains  all  the  pertinent  information  about  a  particular  object, 
such  as  size,  location  and  speed.  In  testing,  new  objects  were  detected  when  either  people  entered 


17 

the  scene,  a  group  of  people  separated,  or  a  person  and  their  luggage  separated.  If  an  object 
was  present  in  the  previous  frame  and  is  present  in  the  current  frame,  its  statistics  are  updated 
and  its  motion  vector  (direction  and  speed)  is  calculated.  This  motion  vector  is  used  to  determine 
if  an  object  is  idle;  if  the  object  remains  idle  for  longer  than  a  threshold  time  t,  it  is  queued  for 
imaging  using  the  high  resolution  camera.  For  this  research,  the  threshold  time  t  was  defined  as 
15  seconds. 


Figure  11:  Foreground  Object  Tracking 

In  testing,  it  was  shown  that  people  standing  still  were  rarely  detected  as  “idle”  because 
even  if  a  person  is  not  moving,  they  are  seldom,  if  ever,  perfectly  still.  Abandoned  luggage  on 
the  other  hand  was  detected  with  a  great  degree  of  reliability,  because  unlike  people,  it  is 
perfectly  motionless  for  significant  amounts  of  time.  When  an  object  is  queued  for  high 
resolution  imaging,  its  size  is  used  to  determine  the  zoom  applied  to  the  high  resolution  camera, 
and  its  coordinates  are  passed  to  the  pan/tilt  mount  to  point  the  camera.  An  example  of  a  frame 
from  the  object  tracking  process  is  shown  in  Figure  12. 


18 


Figure  12:  Example  of  Object  Tracking 

Upper  Left:  Visual  representation  of  object  tracking,  non-idle  objects  shown  as  green  boxes. 
Upper  Right:  Cutout  showing  foreground  segmented. 

Lower  Left:  Input  video  frame. 

Lower  Right:  Foreground  mask 


7.  Testing 

The  first  testing  conducted  was  of  the  background  creation  algorithm,  using  previously 
recorded  video.  Five  test  scenario  movie  clips  were  used,  three  from  the  PETS2006  database  [2] 
and  two  from  video  of  public  areas  recorded  at  the  United  States  Naval  Academy.  Each  of  these 
test  scenarios  was  processed  10  times,  with  the  first  test  conducted  starting  at  the  beginning  of 
the  clip,  the  second  at  10  seconds,  the  third  at  20  seconds,  and  so  on.  The  first  test  was  used  to 
determine  the  time  it  took  to  calculate  the  composite  background  image  from  the  motionless 


19 

images  using  either  an  averaging  or  median  filter.  As  can  be  seen  in  Figure  13,  the  median 


algorithm  took  longer  in  all  five  scenarios. 

7.0000  n 

6.0000 

Time  To  Create  Background  Frame  for  Different  Scenarloa 

• 

o 

♦  -  Averaging 
Algorithm 

B.OOQO  - 

V 

1 

t 

X  -Median 

■g*  4.0000  - 

- 1 - 1 - 

Algorithm 

JS. 

o 

E 

i 

H  3,0000  - 

2.0000  - 

1.0000 

0.0000  - 

*  « 

1 

B  1  Z  1  4  B  a 


Test  Scenario 

Figure  13:  Time  to  Create  Composite  Background  Images 

The  next  tests  focused  on  the  accuracy  of  the  background  images  created  in  the  previous 
testing.  Because  this  research  used  pre-recorded  clips,  it  was  possible  to  obtain  ideal  background 
images;  each  clip  contained  at  least  one  frame  where  there  were  no  foreground  objects  present  in 
the  field  of  view.  By  using  these  frames  as  reference,  it  was  possible  to  test  the  accuracy  of  each 
background  image  created  by  taking  the  absolute  error  of  each  pixel  in  the  composite 
backgrounds,  and  averaging  them.  This  accuracy  is  displayed  in  Figure  14,  which  shows  the 
average  error  for  the  median  and  averaging  algorithms  in  the  five  scenarios.  It  is  readily  apparent 
that  the  median  algorithm  is  significantly  more  accurate  in  all  cases. 


20 


Pixel  Value  Error  For  DIITerent  Test  Scenarios 


16.0000 


14.0000 


12.0000 


g  10.0000 

m 

n 

£  a.oaoo 


8 


S.OCOO 


4.0000 


2.0000 


0.0000 


♦  -  Averaging 
Algorithm 

X  -Median 
Algorithm 


Teat  Scenario 

Figure  14:  Accuracy  of  Background  Creation  Algorithms 


The  final  stage  of  testing  involved  the  complete  system  in  staged  scenarios.  Two  testing 
sessions  were  conducted  at  the  United  States  Naval  Academy,  the  first  in  a  lecture  hall  (Rickover 
103)  and  the  second  in  an  auditorium  (Mahan  Hall  Auditorium).  The  purpose  of  the  test  in  the 
lecture  hall  was  to  determine  performance  in  poor  lighting  conditions.  The  auditorium  test  was 
done  to  show  performance  in  good  lighting  conditions,  which  were  achieved  by  using  stage 
lights. 

Both  testing  sessions  followed  the  same  general  procedure.  The  system  was  set  up  and 
subjects  were  instructed  to  enact  several  different  scenarios.  These  scenarios  ranged  from  a 
crowd  of  people  walking  back  and  forth,  with  no  baggage  abandoned,  to  having  everyone  in  the 
crowd  abandon  some  type  of  baggage.  Testing  results  are  shown  in  Figure  15.  The  quantities 


21 

measured  were:  “Correet  Detection  Rate”;  “Average  Frames  to  Acquire  Object(s)  as  Idle”;  and 
“False  Detections”  per  test  run.  The  “Correct  Detection  Rate”  is  a  measure  of  the  percentage  of 
abandoned  items  the  system  successfully  captured  in  a  high  resolution  image.  “Average  Frames 
to  Acquire  Object(s)  as  Idle”  is  a  measure  of  the  amount  of  time  it  took  for  the  system  to  identify 
that  an  object  has  been  abandoned  (the  system  ran  at  15  frames  per  second).  “False  Detections” 
are  defined  as  the  system  queuing  a  foreground  object  for  high  resolution  imaging  when  it  was  in 
fact  not  abandoned  luggage. 


LOW  LIGHT 

Number 
of  Tests 

Correct 

Detection 

Average  Frames  to 
Acquire  Object(s)  as 

False  Detections 

CONDITIONS 

Run 

Rate 

Idle 

per  Test  Run 

Nothing  Idle,  People 
Wandering 

2 

N/A 

N/A 

1 

1  Person,  1  Drop 

2 

100.00% 

12 

0 

Multiple  People,  1  Drop 

1 

100.00% 

10 

0 

2  People  2  Drops 

1 

100.00% 

14.5 

0 

Multiple  People,  2  Drops 

1 

100.00% 

15.75 

1 

3  People,  3  Drops 

1 

100.00% 

17 

0 

Multiple  People,  3  Drops 

1 

66.66% 

17.25 

1 

Multiple  People,  8  Drops 

1 

62.50% 

19 

1 

GOOD  LIGHT 

CONDITIONS 

Number 
of  Tests 

Correct 

Detection 

Average  Frames  to 
Acquire  Object(s)  as 

False  Detections 

Run 

Rate 

Idle 

per  Test  Run 

Nothing  Idle,  People 
Wandering 

2 

N/A 

N/A 

0.5 

1  Person,  1  Drop 

3 

100.00% 

8 

0 

Multiple  People,  1  Drop 

3 

100.00% 

9 

0 

2  People  2  Drops 

2 

100.00% 

12 

0 

Multiple  People,  2  Drops 

2 

100.00% 

14 

0.5 

3  People,  3  Drops 

3 

100.00% 

15.33 

0 

Multiple  People,  3  Drops 

2 

100.00% 

14 

0.5 

Multiple  People,  8  Drops 

2 

87.50% 

16 

0.5 

Figure  15:  Integrated  System  Testing  Resnlts 


22 

At  the  same  time,  the  high  resolution  images  obtained  were  visually  compared  to 
determine  the  effect  of  lighting  conditions  on  image  quality.  A  sample  of  this  comparison  is 
shown  in  Figure  16  and  Figure  17.  For  reference,  these  photographs  were  taken  from  a  distance 
of  approximately  50  feet.  As  can  be  seen,  the  low  lighting  hampered  the  camera’s  ability  to  focus 
accurately.  Some  of  the  blurring  apparent  in  Figure  16  is  also  due  to  the  slower  shutter  speed 
needed  because  of  the  poor  lighting. 


Figure  16:  High  Resolution,  Poor  Lighting  Conditions 


Figure  17:  High  Resolution,  Good  Lighting  Conditions 


8.  Conclusions 


23 


One  of  the  primary  difficulties  in  working  with  an  automated  security  system  in  the  real 
world  is  that  variations  in  the  scenery  can  cause  significant  errors.  Creating  an  algorithm  that  is 
able  to  adapt  itself  to  any  environment  presented  to  it  is  one  of  the  biggest  hurdles  to  overcome 
in  order  to  create  a  reliable  autonomous  security  system.  This  research  successfully  accomplishes 
this  with  the  background  creation  algorithm.  The  system’s  ability  to  rapidly  create  an  accurate 
representation  of  an  “empty”  scene  allows  deployment  at  any  location  and  immediate  use. 
Additionally,  the  use  of  temporal  differencing  in  the  foreground  segmentation  algorithm  proved 
extremely  successful  for  creating  an  accurate  foreground  mask. 

The  high  resolution  camera  system  proved  to  be  dependable  and  effective  in  testing, 
allowing  for  distant  objects  that  were  barely  visible  in  the  low  resolution  video  to  be 
photographed  in  great  detail.  While  the  objects  could  not  usually  be  identified  in  the  low 
resolution  (webcam)  video  feed,  the  high  resolution  photos  could  allow  a  human  supervisor  to 
determine  whether  or  not  they  consist  of  abandoned  luggage  even  at  long  range.  A  surveillance 
system  could  easily  be  set  up  using  any  number  of  the  high/low  resolution  camera  systems, 
covering  a  large  area  (such  as  an  entire  airport  or  train  station)  and  monitored  by  only  a  single 
supervisor,  since  they  would  only  need  to  occasionally  examine  the  high  resolution  photographs 
flagged  as  security  risks,  rather  than  continuously  viewing  surveillance  videos. 

While  the  images  taken  under  poor  lighting  conditions  were  somewhat  out  of  focus  and 
dark,  they  were  still  clear  enough  for  an  operator  to  positively  identify  the  object  being 
photographed.  The  images  taken  in  good  lighting  conditions  on  the  other  hand  have  excellent 
resolution,  as  can  be  seen  in  Figure  17.  This  photograph,  taken  from  a  distance  of  approximately 
fifty  feet,  actually  allows  a  human  operator  to  read  text  on  the  object.  For  comparison  purposes, 


24 

the  same  object  as  seen  in  the  low  resolution  video  feed  from  the  webcam  was  nothing  more 
than  a  fuzzy  spot,  barely  ten  pixels  across.  The  problems  seen  in  the  high  resolution  images  that 
were  caused  by  poor  lighting  could  be  solved  by  using  a  true  single-lens  reflex  (SLR)  camera, 
which,  due  to  the  quality  of  its  optics,  would  have  better  low-light  imaging  capability. 

System  errors  in  testing  consisted  of  false  detections  and  missed  detections.  There  were 
two  sources  of  these  errors  that  were  encountered.  The  first  was  system  saturation;  this  occurred 
in  the  scenarios  involving  eight  people  abandoning  luggage.  The  system  was  simply  unable  to 
keep  up  with  the  demand  put  on  it  by  having  to  track  so  many  people  and  abandoned  items.  As 
can  be  seen  in  the  results  table  of  Figure  15,  system  saturation  occurred  with  fewer  test  subjects 
and  had  a  more  detrimental  effect  under  low  light  conditions.  In  terms  of  practical  applications  of 
the  system,  this  saturation  is  not  a  significant  issue,  as  it  is  highly  unlikely  that  one  would  ever 
encounter  a  scenario  where  eight  or  more  bags  were  being  abandoned  simultaneously. 

The  second  type  of  error,  false  detection,  is  more  significant.  False  detection  errors  were 
primarily  caused  by  large  fluctuations  in  background  lighting.  This  was  primarily  seen  in  the 
testing  done  under  poor  lighting  conditions,  due  to  the  fact  that  when  there  is  not  much  light  to 
begin  with,  any  variation  will  be  perceived  as  significant.  While  the  algorithms  used  were 
designed  with  this  in  mind,  and  were  for  the  most  part  successful,  some  errors  did  occur.  These 
errors  would  occasionally  result  in  the  foreground  segmentation  algorithm  detecting  objects  that 
were  not  actually  present.  This  resulted  in  the  system  occasionally  taking  high  resolution  images 
when  it  was  not  necessary.  This  issue  could  be  solved  in  future  work  by  having  the  system 
update  the  composite  background  image  more  often,  or  by  having  it  further  reduce  lighting  noise 
by  searching  for  periodic  variations. 


25 

Because  of  the  versatile  nature  of  the  high/low  resolution  camera  combination,  future 
work  with  this  system  is  likely.  Issues  to  be  addressed  include  obtaining  more  computing  power 
to  allow  for  more  complex  algorithms  that  still  run  in  real-time,  and  implementation  of  several 
security  camera  systems  running  together  over  a  network.  Additionally,  this  system  could  be 
evolved  to  a  biometric  identification  system,  which  looks  for  faces  and  performs  high  resolution 
facial  recognition.  Finally,  the  algorithms  could  easily  be  expanded  to  look  for  suspicious  events 
besides  abandoned  luggage,  such  as  the  presence  of  weapons,  people  running,  or  fights. 


9.  References 


26 


[1]  C.  Wren,  A.  Azabayejani,  T.  Darrel,  and  A.  Pentland,  “Pfinder:  Real-time  tracking  of  the 
human  body,"  IEEE  Transactions  on  Pattern  Analysis  and  Machine  Intelligence  19,  pp. 
780-785,  July  1997. 

[2]  N.  Friedman  and  S.  Russell,  “Image  segmentation  in  video  sequences:  A  probabilistic 
approach,"  Proceedings  of  the  Thirteenth  Annual  Conference  on  Uncertainty  in  Artificial 
Intelligence  (UAI-97),  pp.  175-181,  Morgan  Kaufmann  Publishers,  Inc.,  (San  Francisco, 
CA),  1997. 

[3]  Otsu,  N.,  "A  Threshold  Selection  Method  from  Gray-Level  Histograms,"  IEEE 
Transactions  on  Systems,  Man,  and  Cybernetics,  Vol.  9,  No.  1,  1979,  pp.  62-66. 

[4]  Performance  Evaluation  of  Tracking  and  Surveillance  Conference  2006,  “PETS  2006 
Benchmark  Data”,  http://www.cvg.rdg.ac.uk/PETS2006/data.html. 

[5]  Gonzalez,  Rafael,  Richard  Woods  and  Steven  Eddins,  2004.  Digital  Image  Processing 
Using  MATLAB.  Upper  Saddle  River,  NJ:  Pearson  Prentice  Hall. 


10.  Works  Consulted 


27 


1.  Q.  Zhou  and  J.  Aggarwal,  “Tracking  and  classifying  moving  objects  from  videos," 
Proceedings  of  IEEE  Workshop  on  Performance  Evaluation  of  Tracking  and 
Surveillance,  2001. 

2.  L.  Li,  W.  Huang,  1.  Yu-Hua  Gu,  and  Q.  Tian,  "  Statistical  Modeling  of  Complex 
Backgrounds  for  Foreground  Object  Detection,"  IEEE  Transactions  On  Image 
Processing,  Vol.  13,  No.  11,  pp.  1459-1472,  2004. 

3.  J.  Papon,  R.  Broussard  and  R.W.  Ives,  “Dual  Camera  System  for  Acquisition  of  High 
Resolution  Images,”  Proceedings  of  the  SPIE,  Vol.  6496,  pp.  29-37,  2007. 

4.  B.D.  Ripley,  Pattern  Recognition  and  Neural  Networks.  Cambridge:  Cambridge 
University  Press,  1996. 

5.  T.  Masters,  Signal  and  Image  Processing  with  Neural  Networks,  New  York:  Wiley,  1994. 

6.  R.  Duda,  P.  Hart  and  D.  Stork.  Pattern  Classification  2“‘^  Ed.,  New  York:  John  Wiley  and 
Sons,  2001. 

7.  Tekalp,  Murat.  Digital  Video  Processing.  Upper  Saddle  River,  NJ:  Prentice  Hall,  1995. 

8.  R.  Jain,  R.  Kasturi,  and  B.  Schunck.  Machine  Vision.  Boston:  McGraw  Hill,  1997. 

9.  S.  Cheung  and  C.  Kamath,  “Robust  techniques  for  background  subtraction  in  urban 
traffic  video,”  Proceedings  of  SPIE  Visual  Communications  and  Image  Processing,  Vol. 
5308,  pp.  881-892,  2004. 

10.  L.  Tian,  M.  Lu,  and  A.  Hampapur,  “Robust  and  efficient  foreground  analysis  for  real¬ 
time  video  surveillance,”  Proceedings  of  IEEE  Conference  on  Computer  Vision  and 
Pattern  Recognition,  Vol.  1,  pp.  1 182  -  1187,  2005. 


28 


Appendix  A:  Glossary 

Complement  -  A  binary  operation  which  replaces  all  “true”  values  with  “false”  values  and  vice- 
versa. 

Dilation  -  A  morphological  operation  which  expands  the  size  of  “true”  regions. 

Graphical  User  Interface  (GUI)  -  GUI’s  are  interfaces  which  contain  graphical  elements  for 
easy  end-user  control  of  a  computer  algorithm  or  system. 

Majority  Filter  -  A  type  of  filter  that  processes  pixels  in  a  region  of  an  image  and  sets  the 
center  pixel  of  the  region  to  the  dominant  value. 

Mask  -  A  type  of  binary  image  consisting  of  only  logical  values  of  true  or  false.  In  image 
processing,  masks  are  commonly  used  to  eliminate  unwanted  areas  from  images  (for  example, 
regions  with  false  values  are  removed). 

MATLAB  -  a  numerical  computing  environment  and  programming  language  created  by  The 
MathWorks. 

Mega-Pixel-  A  mega-pixel  is  1  million  pixels,  and  is  a  term  used  not  only  for  the  number  of 
pixels  in  an  image,  but  also  to  express  the  number  of  sensor  elements  of  digital  cameras. 
MEX-File  -  MEX  stands  for  MATLAB  executable.  MEX-files  are  dynamically  linked 
subroutines  produced  from  C/C++  or  Fortran  source  code  that,  when  compiled,  can  be  run  from 
within  MATLAB  in  the  same  way  as  MATLAB  files  or  built-in  functions. 

Morphology  -  A  collection  of  techniques  for  digital  image  processing  based  on  mathematical 
morphology.  Since  these  techniques  rely  only  on  the  relative  ordering  of  pixel  values,  not  on 
their  numerical  values,  they  are  especially  suited  to  the  processing  of  binary  images  and 
grayscale  images  whose  light  transfer  function  is  not  known. 


29 

Packet  -  In  digital  communications,  a  packet  is  a  formatted  block  of  information  transmitted 
and/or  received  on  a  computer  network. 

RS-232  -  A  standard  for  serial  binary  data  commonly  used  in  computer  serial  ports. 

Slew  Rate  -  A  measurement  used  for  mechanical  devices  which  move  in  a  circle.  It  refers  to  the 
number  of  degrees  the  device  turns  in  one  second. 

Software  Development  Kit  (SDK)  -  A  set  of  development  tools  that  allows  a  software  engineer 
to  create  applications  for  a  certain  software  package,  software  framework,  or  hardware  platform. 
Universal  Serial  Bus  (USB)  -  A  serial  bus  standard  used  to  connect  hardware  devices  to  a 
computer. 


Appendix  B:  Graphical  User  Interface 


31 

The  MATLAB  Graphical  User  Interface  (GUI)  was  created  specifically  for  this 
research,  and  allows  the  user  to  control  all  aspects  of  the  system.  This  appendix  details  the 
various  functions  of  the  GUI. 

Initialize  Button  -  This  button  initializes  the  controls  for  all  various  parts  of  the  subsystem.  It 
sets  up  the  serial  port  connection  to  the  pan/tilt  mount,  opens  a  USB  connection  and  turns 
on  the  high  resolution  camera,  and  opens  a  USB  connection  and  turns  on  the  webcam.  If 
any  of  these  components  are  not  present,  it  will  notify  the  user  of  an  error. 

Clean  Up  Button  -  This  button  disconnects  the  various  components  of  the  system  (in  software). 
It  can  be  used  if  an  error  occurs  to  reinitialize  the  system,  and  is  done  automatically  when 
the  program  exits. 

Pan/Tilt  Configuration  /  Control 

Max  Speed  -  This  defines  the  maximum  speed  the  pan/tilt  mount  will  attain  as  it  moves. 
It  is  defined  in  degrees/second,  and  can  be  set  for  values  from  1-400. 

Acceleration  -  This  defines  the  maximum  acceleration  rate  the  pan/tilt  mount  will 
achieve  as  it  moves.  It  can  be  set  for  values  from  0.1  to  200  degrees  per  second  squared. 
Note:  Setting  this  to  a  value  of  over  130  causes  significant  jarring  of  the  high  resolution 
camera,  especially  for  small  movements,  and  is  not  recommended. 

Base  Speed  -  This  defines  the  base  speed  of  the  pan/tilt  mount,  and  should  be  set  to  a 
value  of  10  degrees/second.  Anything  higher  than  this  will  cause  significant  jarring  on 
small  movements  and  could  damage  the  high  resolution  camera. 

Query  Speeds  -  This  queries  the  pan/tilt  controller  for  the  different  speeds  listed  above, 
and  fills  the  text  boxes  with  the  appropriate  values. 


32 

Set  Speeds  -  This  sends  the  speeds  shown  in  the  text  boxes  above  it  to  the  pan/tilt 
controller.  It  will  issue  a  warning  if  dangerous  values  are  entered,  and  an  error  if  invalid 
ones  are  entered. 

Pan/Tilt  Motor  Response  -  This  dialog  box  shows  all  the  commands  sent  to  the  pan/tilt 
controller,  and  its  machine  language  responses.  It  can  be  used  to  make  sure  that  any 
commands  sent  to  the  mount  have  been  successfully  received. 

Pan/Tilt  Manual  Control  -  These  dialog  boxes  allow  the  user  to  manually  control  the  pan 
and  tilt  of  the  mount.  Valid  values  for  pan  (left  right)  are  from  -2800  to  3000,  and  from 
1500  to  -1500  for  tilt  (up/down). 

Canon  Camera  Configuration  /  Control 

Image  Size  -  This  defines  the  size  of  the  image  taken  by  the  high  resolution  camera.  The 
three  available  options  are  Large  (2800  x  2400),  Medium  (1920x1680)  and  Small 
(640x480). 

Auto  Shutter  Release  -  This  allows  the  algorithms  to  automatically  move  and  capture 
images  of  abandoned  luggage.  As  a  precaution,  ensure  that  nothing  is  touching  the 
pan/tilt  mount  when  this  option  is  enabled,  and  that  there  are  no  wires  that  might  get 
caught  when  the  mount  moves  (It  can  easily  cut  through  them  if  the  acceleration  and 
maximum  speed  values  are  set  on  the  high  end). 

Shutter  Release  -  This  manually  releases  the  high  resolution  camera  shutter,  and 
downloads  the  image  acquired  to  the  GUI. 

Zoom  -  This  is  a  manual  control  for  the  high  resolution  camera  zoom,  and  should  not  be 
tampered  with  while  Auto  Shutter  Release  is  enabled. 


33 

Show  Last  Still  Shot  -  This  displays  the  last  image  acquired  by  the  high  resolution 
camera  in  a  window  in  the  GUI. 

Open  Webcam  Preview  Button  -  This  opens  a  window  which  shows  a  live  video  feed  from  the 
webcam.  This  button  should  be  pressed  before  Auto  Shutter  Release  is  enabled. 

Start  Difference  Image  Button  -  This  button  starts  the  foreground  segmentation  and 
background  tracking  algorithms.  Additionally,  it  opens  a  window  which  shows  a  visual 
representation  of  what  the  tracking  algorithm  is  detecting. 


Appendix  C:  MATLAB  Code 


34 


The  following  appendix  contains  listings  of  the  MATLAB  code  used  in  the  final 
algorithms.  Code  for  the  Graphical  User  Interface  is  not  included  for  brevity’s  sake,  but  can  be 
obtained  by  contacting  the  author. 

Buildbackgroundgraymedian.m  -  This  is  the  function  to  create  the  composite  background 
image. 


%Composite  Background  Creation 
%Jeremie  Papon,  2007 

function  backimg  =  buildbackgroundgraymedian ( f ile) 

%Open  the  first  video  frame  to  get  size  information  about  it  and 
%allocate  the  appropriate  amount  of  memory 
picl  =  imread ( file ( 1 ). name) ; 

backimg  =  uintS ( zeros ( size (pic2 , 1 ), size (pic2 , 2 ))) ; 
picarray  =  zeros ( size (pic2 , 1 ), size (pic2 , 2 ),  330 inti  6 ') ; 

for  picindex=l : 100 

%Open  the  video  frames  to  analyze  them 

picl  =  cvcolor_mex ( imread ( file (picindex) . name) rgb2gray ') ; 
pic2  =  cvcolor_mex ( imread ( file (picindex+1 ). name) rgb2gray ') ; 
%Calculate  the  Difference  Mask 
diffimg  =  imabsdif f (picl , pic2 ) ; 

%Threshold  the  Difference  Mask 
diffimg  =  graythresh (diffimg) ; 

%Do  a  majority  Operation  to  reduce  noise  from  temporal  differences 
diffimg  =  bwmorph (diffimg, ' maj ority ') ; 

%Dilate  the  Image  to  increase  areas  of  movement 
diffimg  =  cvlib_mex (' dilate ’, diffimg, 6) ; 

%Multiply  by  -2  so  all  moving  areas  are  negative 
tempimg  =  inti 6  (diffimg  (-2)); 

%Add  one  so  movement  in  -1  and  movement  is  1 
tempimg  =  tempimg  +  1; 

%Multiply  Difference  mask  by  video  frame 
temp  =  inti  6  ( inti  6  (picl )  tempimg); 

%Add  the  result  into  the  motionless  structure 
picarray (:,:, picarrayindex)  =  inti  6 (temp (:,:)); 


end 


%Cycle  through  each  (x,y)  pixel  location 
for  X  =  1:  size(pic2,l) 


for  y  =  1:  size(pic2,2) 

%Find  all  values  for  this  location  that  were  motionless 
tempvals  =  picarray (x, y, f ind (picarray (x, y, : ) >=0 ) ) ; 

%Set  the  background  value  for  this  location  to  the  median  of 
%all  motionless  values  for  this  location 
if (size (tempvals) >0) 

backimg(x,y)  =  uint8 (median (tempvals )) ; 

else 

backimg(x,y)  =  0; 

end 

end 

end 


Subwaydemo.m  -  This  file  shows  a  demo  of  the  foreground  segmentation  and  object  tracking 
algorithms. 

%Foreground  Segmentation  and  Object  Tracking  Demo 
%Created  by  Jeremie  Papon,  2007 

%Initialize  Tracking  Structure 
objectlist  = 

struct ( ' XbyY ' , { } , ' WbyH ' , { } , ' idlef rames ’ , { } , ' speed ' , { } , ' color ',{}); 

%Set  Various  Thresholds 
SPEEDTHR  =  2.0; 

DISTTHR  =  40; 

SIZETHR  =  30; 

%Initialize  Morphological  Mask 
SE  =  strel (’ square 10 ) ; 

%Open  the  PETS2006  Data  Set  and  Read  it  into  memory 

dos  (  '  dir  /B  c  :  XpaponXSubwayDataXsV-tG-bXvideoXpetsSOOGXsV-tG-bXSX"^ .  jpeg 
f ilelist . txt ' ) ; 

f id=fopen ( ' f ilelist . txt '  )  ; 
index=l ; 

while  ~feof (fid) 

file (index) .name  =  ['c:\papon\SubwayData\s7-tG-b\video\pets200G\s7 
tG-b\3\’  fgetl(fid)]; 

index=index+l ; 
end 

f close (fid) ; 
index  =  index  -1; 

%Call  Composite  Background  Creation  Function 
backim  =  buildbackgroundgraymedian ( f ile) ; 
background  =  backim; 

background  =  imresize (background,  . 5 )  ; 

%Read  in  first  frame.  Convert  to  Grayscale,  and  Reduce  it  in  Size 

newf ramecolor  =  imread ( file ( 1 ). name) ; 

newframe  =  cvcolor_mex (newf ramecolor ,' rgb2gray ') ; 

newframe  =  imresize (newframe, .5) ; 

%Create  the  "old"  background  difference  mask 
oldbackdiff  =  imabsdiff (background, newf rame) >20 ; 


36 


for  n  =  2:  size(file) 

%Clear  the  Deletion  Array  for  Tracked  Objects 
deletearray  =  logical (ones ( size (obj ectlist) )) ; 

%Read  in  the  next  frame,  convert  to  grayscale,  and  resize  it 

newf ramecolor  =  imread ( file (n) . name) ; 

newframe  =  cvcolor_mex (newf ramecolor ,' rgb2gray ') ; 

newframe  =  imresize (newframe, .5) ; 

%Create  the  "new"  background  difference  mask 
newbackdiff  =  imabsdiff (background, newf rame) >20 ; 

%Determine  where  the  two  background  differences  are  both  1  (and  AND) 
backdiff  =  oldbackdiff  +  newbackdiff  >  1; 

%Set  the  "old"  background  difference  to  the  new  one 
oldbackdiff  =  newbackdiff; 

%Perform  a  Majority  Operation  on  the  difference  mask 
backdiff  =  bwmorph (backdiff ,' maj ority ') ; 

%Perform  the  morphological  closing  operation  on  the  difference  mask 
backdiff  =  imclose (backdiff ,  SE) ; 

%Create  a  label  matrix  for  the  foreground  mask 
labelmat  =  bwlabel (backdiff ,  4 )  ; 

%Find  the  contiguous  regions  in  the  mask 
stats=  regionprops (labelmat, ' BoundingBox ' ) ; 

%This  Loop  Cycles  through  all  of  the  found  objects 
for  boxes  =  1:  size  ( stats , 1 ) 

%Read  location  and  size  information 
Xloc  =  stats  (boxes ). BoundingBox ( 1 ) ; 

Yloc  =  stats (boxes ). BoundingBox ( 2 ) ; 

Width  =  stats  (boxes ). BoundingBox ( 3 ) ; 

Height  =  stats (boxes ). BoundingBox ( 4 ) ; 
newobject  =  1; 

%This  cycles  through  all  old  objects 
for  index  =  1 :  size  (obj ectlist , 2 ) 

Xdist  =  abs (Xloc  -  obj ectlist { index }. XbyY ( 1 )) ; 

Ydist  =  abs (Yloc  -  obj ectlist { index }. XbyY ( 2 )) ; 

Widthdiff  =  abs (Width  -  obj ectlist { index }. WbyH ( 1 )) ; 
Heightdiff  =  abs (Height  -  objectlist{ index} .WbyH (2) ) ; 

%Find  if  the  current  object  is  the  same  as  an  old  object 
if ((Xdist  +  Ydist)  <  DiSTTHR  &&  (Widthdiff  +  Heightdiff  < 

SiZETHR) ) 

obj ectlist { index }. WbyH  =  [Width  Height]; 
obj ectlist { index }. XbyY  =  [Xloc  Yloc]; 
obj  ectlist { index } . speed  = (obj  ectlist { index } . speed+ 
sqrt  (  (Xdist )  ^^2  +  ( Ydist )  ^^2  )) /2  ; 

%if  it  is  and  old  object.  Check  to  see  if  idle 
if  (obj ectlist { index }. speed  <  SPEEDTHR) 
obj ectlist { index }. idleframes  = 
obj ectlist { index }. idleframes  +  1; 

if  (obj ectlist { index }. idleframes  >  15  && 
obj ectlist { index }. idleframes  <  25) 

obj ectlist { index }. color  =  'y'; 
elseif (obj ectlist { index }. idleframes  >  24) 
obj ectlist { index }. color  =  'r'; 

end 


else 


37 


obj ectlist { index }. idleframes  =  0; 
obj ectlist { index }. color  =  'g'; 

end 

%Dont  Delete  This  Object 
deletearray ( index)  =  0; 

%Not  a  new  Object 
newobject  =  0; 
break; 

end 

end 

%Create  a  New  Object  if  no  old  objects  matched  this  one 
if  (newob j ect ) 

newobj  =  size(objectlist,2)+l; 

obj ectlist { newob j }. XbyY  =  [Xloc  Yloc] ; 

obj ectlist { newob j }. WbyH  =  [Width  Height]; 

obj ectlist { newob j }. idleframes  =  0; 

obj ectlist { newob j }. speed  =  0.1; 

obj ectlist { newob j }. color  =  'g'; 

deletearray (newob j )  =  0; 

end 

end 

%Delete  all  objects  from  the  object  list  that  were  not  scene  this 
%iteration 

obj ectlist (deletearray)  =  []; 

%This  Creates  the  four  paned  movie  output  showing  the  various  parts 
%of  the  algorithms 

movieimg  =  uintS  (zeros  (size  (background,  1)  '^2,  size  (background,  2)  "^2)  )  ; 
movie img ( 1 : size (background, 1 ) , 1 : size (background, 2 ) )  =  background; 

movieimg  ( 1 :  size  (background,  1 )  ,  size  (background,  2 )  +1 :  size  (background,  2)  '^2)  = 
uintS  (oldbackdif f )  "^255; 

movieimg  (size  (background,  1 )  +1 :  size  (background,  1 )  '^2 ,  size  (background,  2 )  +1 :  size  ( 
background,  2 )  "^2 )  =  uint8  (backdif  f ) '*'255 ; 

movieimg  (size  (background,  1 )  +1 :  size  (background,  l)'''2,l:size  (background,  2)  )  = 
newf rame ; 

imshow (movieimg) ; 
for  index  =  1 : size (obj ectlist , 2 ) 

rectangle (' Position ’ ,  [ obj  ectlist { index } . XbyY (1) ,objectlist{ index } . XbyY (2) , ob j 
ectlist { index } . WbyH (1) ,objectlist{ index } . WbyH ( 2 ) ] ,  ' FaceColor ' , obj  ectlist { inde 
X } .color,  ' EdgeColor ' , obj  ectlist { index } .color) 

text (obj  ectlist { index } . XbyY ( 1 ) +f loor (obj  ectlist { index } . WbyH ( 1 ) /3 ) , obj  ectlist { 
index } . XbyY (2) +floor (objectlist{ index } . WbyH (2) /3 ) , num2str (obj  ectlist { index } . i 
dleframes) ) 

end 


end 


Appendix  D:  Papers  Published 


38 


Two  papers  were  published  based  on  this  research.  The  first  was  published  in  the 
Proceedings  of  the  19*’^  Annual  SPIE/IS&T  Symposium  on  Electronic  Imaging.  The  second  was 
published  in  the  Proceedings  of  the  2E*  National  Conference  on  Undergraduate  Research.  Both 
papers  have  been  included  here. 


39 


Dunl  Camera  System  for  Acqiiisitiou  of  High  Resoliitiou  Images 

Jereuue  A,  Pfipou,  Randy  P.  Brou^^ard,  Robert  W,  Ives 
United  States  Naval  Academy,  Annapolis,  MD  21412 


ABSTRACT 

Video  stiAeiUance  is  itbiquiious  in  modem  societjs  bm  svneiUaiice  cameras  are  se\wly  Lintited  in  by  tbeir  low 
resolution  With  this  in  mind,  we  liave  develops  a  s>"^tem  that  can  atuonciuousLy  take  high  resolution  stiJl  frame 
images  of  moving  objects  In  order  to  do  Ehis  we  c  ornibiiie  3.  low  resolution  ^ideo  camera  and  a  high  resolution  still 
&ame  camera  moitnled  on  a  pan  tilt  mount  In  order  to  detemmie  w  hat  should  be  photogiaphed  (objects  of  interest),  w  e 
employ  n  hierarchical  method  which  first  separates  fcregroiiod  from  background  using  a  Ten^>oral-based  median 
filtering  technique.  We  then  use  a  feed-forward  neural  network  classifier  on  the  foreground  regions  to  deteinciine 
whether  The  regions  contain  the  objects  of  interest.  This  is  done  osier  se\'eral  frames,  and  a  motion  vector  is  deduced  for 
the  object.  Ibe  pan  tilt  mount  then  focuses  the  high  resolution  camera  on  the  nest  predicted  location  of  the  object,  and 
an  image  is  acquired  All  coii^ouents  are  controlled  through  a  single  MATLAB  graphical  user  interface  (GUI)  The 
final  sy-stem  we  present  wall  be  able  to  detect  multiple  moving  objects  simultaneously.  Hack  them,  and  acquire  high 
resolution  images  of  tfrem.  Results  v^ill  demonstrate  performance  tracking  and  imagiiLg  varying  numbers  of  objects 
moving  at  different  speeds 

Key-words:  sun'eiUance,  real-ttme  \ideo,  biometries,  MATLAB 

1  IXTRODrCTION 

Curratly  seiCTal  dries  and  ai^rts  has^e  iniplenKiiied  large  scale  secunty  camera  netw-orks  which  seek  to  use 
stationary*  cameras  for  sealritJ^  Tlie  primary'  problem  witli  these  CCT\-  systems  is  the  quality'  of  the  images  They  take: 
security*  cameras  are  generally  of  low  resolution,  winch  greatly  restricts  the  accuracy  of  any  processing  that  might  be 
done  on  the  images  Tins  project  seeks  to  resoh'e  this  problem  by  developh^  a  dual  camera  system  winch  contains  a 
Icrtv  resolution  web  camera  and  a  lugh  resolution  still-frame  digital  camera.  The  low  resolution  canwra  remains 
staiionaiy^  wlnJe  the  high  resolution  camera  is  mounted  on  a  tnotonied  pan  tilt  mount  The  low  resolution  video  is  used 
for  monitoring  and  detecting  objects  of  interest  using  a  MATLAB  graphical  user  itiTerface  (GUI)  This  same  GUI 
connois  the  pan.  tilt  mount,  which  points  the  high  resolution  camera  and  then  controls  the  capture  of  the  still-frame 
images. 


2  DESC  RIPTION  OF  COMPONXNTS 


2.1  Low  i  esolutioii  camera 

The  Imv  resolution  camera  is  a  Logitech  Quickcam  Pro  5000  (Figure  1).  This  camera  has  proven  ideal  for  the 
purpose  of  suneillance  because  it  coaoLbines  good  resolution  (640  x  480  pixels),  fast  frame  rates  (30  frames  per  second), 
and  a  w'ide  field  of  dew  (82  degrees)  in  a  small  robust  package.  The  Qiiickcam  is  stattonaiy^  and  surveys  an  entire  area 
of  interest  at  once,  continuously  monitoring  for  ino\iement.  UTien  movement  in  the  field  of  siew  is  detected,  the  camera 
output  IS  processed  for  object  localisation  The  (Juickcam  interfaces  directly  with  MATLAB  via  a  USB  cable,  which 
allows  for  quick  processing  of  the  video  feed 

2.2  High  t esoluiion  cautm 


The  high  resolution  cantera  consists  of  a  6  mega-pLxel  Canon  Powershot  S3IS  (Figure  2).  This  camera  wa$ 
chosen  because  of  lU  good  low  bgltt  performance  and  motonzed  12  rimes  optical  zoom.  live  optical  zoom  is  exnemely 
fast,  and  can  be  controlled  remotely,  which  allows  for  imaging  of  small  objects.  The  earners  is  mounted  on  a  high  speed 


40 


pm  tilr  mount  for  pointing,  and  comes  with  a  Software  De^'clopment  Kit  (SDK)  w^hicb  allow's  for  remote  control  of  the 
camera  over  a  USB  cable,  including  zoom  and  capture  fijnctions. 


Figurf  1:  Qnickfam  Pro  5000 


2.3  Pan/TUt  mount 

The  pantilt  mount  is  a  Directed  Perception  PTU-D+b-l?  winch  can  pan  a  full  360  degrees  and  tilt  ISO  degrees, 
at  speeds  of  o\'cr  300  degrees  a  second.  The  mount  allows  for  fast,  stable,  and  accurate  pointing  of  the  high  resolution 
cainera  The  pan-tilt  mount  is  controlled  serially,  and  can  be  commanded  directly  from  the  GUI  using  the  serial  port 
functionaliw  built  into  MATLAB. 


3  ESTERTACDs  G  OF  COMPONENTS 

The  first  major  hurdle  in  developing  a  s>fstcin  that  combines  various  off-the-shelf  con^)onents  into  a  single 
working  unit  is  inter&cing.  Both  the  high  resolution  camera  and  the  mount  required  separate  controls  to  be  developed  in 
order  to  meet  their  desired  functio^alit>^  This  was  acconq}Eshed  in  MATLAB. 

Control  of  the  Pan-Tilt  mount  (Figure  3)  accoirqjli^d  using  the  computer's  serial  port  The  serial  port  is 
used  to  send  commands  to  the  Pan- Tilt  mount's  controller,  and  allows  control  of  both  the  movement  of  the  mount  and 
all  of  its  settings,  such  as  slew  rate  and  acceleration 


Figurf  2:  Canon  Powfnhot  S3IS 


41 


Fijni*^  nirrrfM§  F^rcrptifin  PanrTiltMnnnt 


Controlling  tlie  Caonn  Powcrshot  provf  d  to  be  mucb  more  difOcult,  primarily  because  the  Canon  SDK,  wtich 
prmides  a  basic  Iramework  for  communicating  v<nlih  tlieir  cameras,  ft^s  wnttcn  m  tlie  C  prcgrammiDg  language  In 
mifler  to  use  thp  ramera.  all  nf  Ti»e  cjwitrol  fiiricTictns  hart  In  he  ported  inin  MATT.AR  ming  VTRY  files  VTFX  files  are  a 
specific  of  MATLAB  file  viliich  are  wiirtcn  in  C,  but  can  compile  and  run  in  MATLAB  MEX  files  for  initializing 
the  camera.  clungiDg  camem  settings,  and  controlling  the  sburter  were  created.  But  rbe  acrual  tranifcr  of  linages  from 
the  big^  nesolulicMi  camera  tc  the  coc^fcr  citct  the  USB  (Universal  Serial  Bus)  cable  pro\?ed  problmiatc,  as  the  CaDon 
frameTOrk:  for  doing  this  could  not  be  ported  into  MATLAB.  As  such,  neu^  fimctions  were  iMitten  in  C  wliich 
transferred  tbe  image  data  from  the  camera  over  (he  USB  cable  in  individual  packets  locket  sues  in  initial  testing  were 
limified  to  only  1024  bj^Hes,  which  resulted  in  very  poor  transfff  speeds  for  ioiages  (upvtairts  of  5  seconrta).  After  several 
adjustmenn  and  misions  to  tbe  transfer  jffotMrol,  128  kilobyte  packets  were  used  succcssMly,  Aiiict  increased  uansfei 
speeds  substantially  (approximately  one  second  Inr  a  one  mega^Te  image).  Ihe  con^lete  assembled  system  can  be 
seen  iu  Figure  4. 


Fieure  4:  The  Syite m 


42 


4  OBJECT  DETICTIO" 


4.1  roi^gi»iiud  einactloD 

The  first  step  in  detecting  motion  m  video  is  extracting  regions  of  interest  &oin  the  background.  Two  methods 
were  implemented  to  do  rbis;  motion  segmentation  and  background  subtraction.  Both  are  able  to  extract  tlie  foreground 
bm  differ  somewhat  in  speed  and  accuTao>\ 

4.1.1  Motion  segmentation 

^lotion  segmentanon  is  implemented  by  firnding  the  absolute  difference  between  the  current  frame  and  a 
previous  frame  and  then  thresholding  the  result.  The  equation  for  doing  this  is  shown  in  etpation  (1)  where  pixel 
in  frame  subtracted  from  a  frame  /  frames  ago.  and  thresholded  using  a  Threshold  value  T.  Here,  m  and  n 
represent  the  row  and  column  numbers,  andc  represents  the  three  color  planes  of  red.  green,  and  blue. 

3 

c=; 

After  thresholding,  morphological  teclnuques  are  used.  The  nK3rpholog>^  cmsists  of  erosion  to  eliminate  noise, 
and  then  dilation  to  expand  the  areas  of  motion.  Tins  is  done  because  it  was  deremiined  in  testily  that  ir  ii  better  to  ger 
extra  area  surrounding  detected  morion  in  this  vrep.  rather  than  losing  area:>  dtve  to  processing  rcleranecs. 

Wliile  morion  segmentation  is  con^utationahy  efficient  and  adaptable  to  changing  backgrounds,  testing 
has  sboHn  tliar  it  has  limitation.  Variations  in  lighting,  ^^ideo  noise,  and  statioiiar^^  foieground  objects  can  cause  ir 
significant  problems. 

4  J.2  Background  snbtracrion 

Background  subtraction,  while  more  computarjonally  intensive,  can  be  more  reliable  Tltis  technique  requires 
building  a  com[pofl.iTe  background,  which  can  be  done  in  many  different  w'ay^.  each  which  have  their  own  advantages 
and  disadvantages.  Tlte  most  common  metfiods  are  Kahn^n  filtenng\  mi?wture  of  Gaussiansv  and  median  filtering^  for 
die  purposes  of  this  research,  none  of  these  wclimques  were  computationally  efficieiK  enougli  to  be  appUed.  so  a  new' 
rechnique  was  developed  fhat  builds  on  tliese  techniques.  The  base  prejmse  of  tlie  approach  is  tliat  pixel  %'alues  for 
background  will  reocavr  quiee  often  oser  time,  wuth  slight  vanations  due  eo  lighting.  As  such,  a  composite  background 
built  where  weighty  are  assigned  for  the  values  of  each  pisel  depending  on  ftequeney  of  occuxience  over  me-  These 
wejgjnts  a^e  adjusEed  in  each  fraine  accofdhug  to  equatii^  (2),  where  iis-  specifies  the  w'eighr  for  intensic>'  value  t  L  is  the 
learning  rate,  and  A'  ts  1  for  the  new  v^ilue  and  0  for  others , 


H';  =  iv  ■  H-  —  -  iv, ) 


(2) 


As  new'  values  appear,  they  are  added  eo  the  list  of  sveights,  and  so  over  nme,  the  noost  commonly  occuETUig 
values  for  each  pixel  will  be  ^veu  large  w'eighES.  If  a  foreground  objecE  moves  into  the  scene,  its  w'eighE  value  w’iU  be 

extremely  low;  aud  thu^i  can  be  segmented. 


4.2  Color  segmentartou 


Once  foreground  regions  are  segmented,  these  regions  are  put  through  frinher  processing  to  derermiue  where 
color  values  comespoitding  to  the  objects  of  mterest  are  located.  Detection  is  based  on  the  principle  that  in  certain  color 
planes,  these  values  occiqjy  a  very'  narrow'  range.  After  extensive  testings  the  LU\-"  color  space  was  chosen  because  it 
allouied  for  the  most  accurate  E^egmentation,  while  soil  u^ing  integer  values  (ivhich  allow's  for  a  fosr  processing 
technique  that  will  be  tihown  later).  A  pixel  in  the  LU\"  color  plane  contains  three  \raluea;  one  for  luminance 
i^biightness),  andtw'o  for  chrominance  (color). 


43 


In  order  to  in^LeiDeut  the  color  segmentalion  in  a  manner  as  efiTicient  a&  possible,  the  nmiaber  of  comparisoos  to 
detenmne  if  a  pncel  is,  for  example,  skin  must  be  kept  as  low  as  possible  This  is  necessar}'  because  e^er)^  sideo  frame 
contains  64Qx4S0  pixels,  or  307.200  pixels  A  bnite  force  in^leraentation  would  require  3  conq^arisons  per  pixel,  as  one 
needs  to  determine  whether  each  of  the  three  components  of  the  color  plane  fall  between  a  low  and  high  threshold  (that 
is.  whether  a  pixel's  color  is  “close  ‘  to  that  of  an  object  of  interest),  and  tlten  two  "AKD  "  operations  on  these  three 
binars^  soloes.  These  S  con^arisons  per  pixel  amount  to  2.457,600  comparisons  per  fiame.  This  proved  to  be  too  slow; 
since  30  firanws  per  second  were  bemg  processed,  so  a  ne\v  faster  approach  was  in^letuented.  In  diis  approach,  the 
range  of  values  are  placed  m  three  data  stnictures:  one  for  each  pomon  of  the  color  space  A  grapahcal  representation  of 
these  structures  is  seen  in  Figure  5,  The  structures  contain  a  logical  1  or  0  for  every  possible  L.  U.  and  V  value  (integers 
With  \^ue  0  to  255),  These  sUTKmres  allow  for  much  faster  detwnous,  as  it  only  requires  three  qmck  ''AND" 
operahoos  to  detenmne  if  a  pLxel  lies  w-ithm  the  range  of  color  values.  Hus  ceduces  the  ma-xinmm  number  of 
comqjutatioiis  per  frame  from  2,457,600  to  921,600  In  trials,  this  amounted  to  a  reduction  of  about  05  seconds  per 
frame  (a  sigmfrcant  anioimt  of  processing  time  saved  for  real  time  \4deo). 

L  U  V 


0 

0 

1 

D 

2 

0 

D 

0 

140 

1 

W1 

1 

142 

1 

1 

230 

1 

231 

0 

232 

D 

D 

266 

D 

0 

0 

1 

0 

2 

0 

M 

6 

BO 

0 

100 

1 

1D1 

1 

102 

1 

117 

i 

114 

1 

llfl 

0 

12Q 

0 

2^ 

6 

3SS 

0 

0 

0 

1 

0 

2 

0 

m 

6 

13B 

0 

140 

1 

141 

1 

142 

1 

157 

i 

168 

1 

15B 

D 

100 

0 

0 

296 

D 

Fi^arf  5:  Color  SeguLfotniion  Data  Sti  izcrnTf 
4r3  Neural  nef^vui  k  for  cLnssificafioa 

At  tins  potnr  the  ndeo  frsune  has  been  reduced  to  a  few  windows  which  may  conram  either  the  objects  of 
interesr  Once  these  winrkuss  are  excracted,  ihe^^  must  be  classified  as  either  objects  of  interesf  or  non-objects  of  interest. 
To  do  This,  a  mulfilaver  perceptron  neural  network  classifier  is  emplo>^.  Neural  netw'orks.  named  after  their 
resemblance  in  fUncticiialilv-  to  die  immn  nervous  sv'stem.  consist  of  large  numbers  of  single  processing  elements 
caUed  neurons,  wtuch  exhii^t  coniplex  global  behavior  based  on  tlteir  connections  and  indii-idual  neuron  parameters. 
Neural  networks  are  used  piimanly  to  solve  problems  for  which  no  mathematically  optimal  solution  exists:  such  as 
detemuuing  whether  or  mot  an  image  contams  a  t)pe  of  object 


44 


TeachAJse 


Teaching  Input 


OltpLJt 


Figurr  6:  A  Nturon 


Each  iudiv'idual  Qeium  in  a  nemal  netwcik  is  single  in  nature;  it  consists  of  a  summation  and  a  duesbolding 
fijnction.  An  «tan^le  of  a  neuron  is  shown  in  Figure  6.  Each  iipiT  Xi  is  multiplied  by  a  weight  IT'j  and  these  products 
are  siuniued  in  the  neuron.  This  calue  is  then  thresholded  by  a  sigmoid  iunctioD,  and  The  output,  either  a  one  or  zero,  is 
sent  to  its  output,  wliich  may  be  the  next  layer  of  nodes.  The  type  of  neural  network  enqiloyed  in  this  lesearcb  is  called  a 
multilayeT  perception  &ed-£Diw'aiid  netwoik^.  This  type  of  neural  netw'oik  is  chaiacteii^  by  two  principal  elements;  all 
of  the  iiputs  ate  connected  to  cveiy'  hidden  node,  and  information  only  goes  forward  (i.e.  there  is  no  feedback).  This 
type  of  network  was  chosen  because  of  its  qwed;  since  there  is  no  feedback,  all  conqtutatLons  can  occur  simultaneously, 
lowing  very'  fast  cla.ssificatian.  An  exaup^le  of  a  feed  fonc'aid  network  siiuilar  to  the  one  used  is  displayed  in  Figure  7. 

Input  Layar 


Input  1 


Input  2 


Input  ^ 


Input  4 


Input  n-2 


Input  n-1 


Input  n 


Hidden  Layer 


Output  Layer 


Fignrf  7:  A  Feed  Fom  ird  Ntural  Nttnork 


45 


Figiire  7  shows  a  neural  uetw-ork  that  has  n  input  values  presented  at  a  time  to  each  of  four  neurons  in  a 
“hidden”  layer.  Tire  output  of  these  hidden  neurons  is  passed  to  each  of  two  neurons  in  the  output  layer.  As  discussed 
earlier,  each  neuron  or  ''node”  has  weights  wirich  determine  the  output  for  any  given  set  of  inputs.  In  order  to  determine 
the  values  of  the  w^eights,  and  thus  get  tire  conect  output,  the  network  must  be  trained.  This  is  accomplished  b>'  building 
a  set  of  training  data,  wfrich  contains  thousands  of  examples  of  objects  of  interest  aird  iron  interest.  Objects  of  interest 
are  singly  images  of  what  we  are  ny^ing  to  find.  Objects  of  non-iirterest  on  the  otlrer  haird  can  he  quite  JiteraJly, 
anytlring.  As  such,  for  the  purpose  of  buildiirg  this  training  data,  random  window's  of  images  which  did  not  contain 
objects  of  interest  w^ere  used.  Finally,  in  order  for  the  windows  to  be  processed  by  the  network  code,  the>^  must  be 
changed  from  image  matrices  to  one-dimensional  vectors. 

Once  each  window'  is  classified  (as  of  interest  or  non-interest)  and  transformed  into  a  vector,,  a  matrix  is  created 
in  which  each  row' represents  a  Windows's  vector,  with  the  first  column  containing  the  correct  classificatiou.  Tliis  matrix 
is  then  used  iteratively  to  find  the  best  weights  for  tlte  liidden  nodes  so  that  tlie  neural  network  obtains  the  best 
classification  performance  possible  (since  the  correct  classificatiou  is  known  during  training).  One  of  the  key  variables 
in  training  is  the  number  of  Ihdden  nodes;  various  numbers  of  nodes  were  tested  in  order  to  determine  the  number  of 
hidden  nodes  that  gave  the  best  accuracy.  Six  liidden  nodes  worked  the  best,  giv-ing  fast  computatiou  times  and  over 
96%  accuracy  on  the  training  set. 

One  of  the  other  key  variables  is  the  number  of  inpit  values  passed  into  tlie  neural  uetw^ork  for  each 
classification  (that  is,  number  of  elements  in  each  input  vector).  For  tins  research,  this  amounted  to  the  number  of  pixels 
used  for  window.  Various  window  sizes  were  tested,  ranging  firom  as  small  as  6x6  up  to  30x30.  It  was  found  that 
increasing  resolution  past  12x8  actually  resulted  in  a  decrease  in  performance  (both  in  efficiency  and  acciiracy).  UTiile 
at  first  this  may  seem  counter  intuitive,  it  is  a  common  occunence  in  neural  netwwks.  It  is  primarily  due  to  the  fact  that 
objects  only  have  so  many  distinguishing  features  common  to  all  of  them.  Once  the  resohitiou  is  increased  bey^oud  a 
certain  point,  the  extra  information  provided  by  the  extra  pixels  is  spirious  at  best,  and  often  detrimental  to 
perfonnance. 


5  OVERALL  OPERATION 

Once  a  w^dow  is  classified  (manually)  as  an  object  of  interest,  pertinent  information  (centroid  location,  size,  and 
speed)  about  the  window'  is  stored  in  a  data  structure.  Only  location  and  size  are  stored  in  tlte  initial  detection,  with 
speed  being  stored  after  the  same  object  is  detected  in  four  successive  ftames.  Speed  is  determined  by  a  weighted 
average  of  differences  between  the  centroiids  from  successive  frames.  The  etjuadon  for  doing  this  is  sliown  in  equation 
(3),  wdiere  h=1  is  the  most  recent  fi:ame,  and  ^  are  the  w'eight  average  constants. 

j 

F]=l 

?.l=] 

^„=[.5,.3=.2] 

The  weiglited  average  allows  for  changes  in  speed,  while  prev'enting  excessive  spikes  due  to  errors  in  window^ 
centering.  Once  a  speed  is  detennined,  the  object  is  queued  for  capture.  WTren  tlie  object  reaches  the  beginning  of  the 
queue,  tlie  pan- nit  mount  is  moved  to  its  next  location  (as  predicted  by  the  objecTs  speed),  and  a  high  resolution  image 
is  taken.  This  method  is  particularly  robu'it  to  error  because  a  window'  is  only  queued  for  imaging  if  it  occurs  four  times 
in  a  row^  For  something  other  then  an  object  of  interest  to  be  queued,  it  w'ould  have  to  be  falsely  detected  in  four 
successive  frames,  w'hich  testing  has  shown  is  extremely  unlikely.  The  entire  process  is  run  from  one  NTATLAB 
graphical  user  interface,  which  gives  options  for  all  possible  settings  for  the  mount  and  cameras.  Tlte  GUI  also  allow's 
for  manual  capture  of  areas  in  the  low  resolution  video  frame. 


46 


6  RESILTS 

Tbe  ha^  been  undergooe  &ome  preliiniiiaiy  leitmg  that  has  pfc^en  iis  effec»i\^ess  for  high  resolution 
surv  eillance.  The  swem  allows  for  almost  any  t)pe  of  object  recognition  algonrhm  that  incoiporates  ’v^deo  processing 
to  be  inserted  as  tbe  driving  force  for  the  high  resolmrion  camera  The  system  is  simially  independent  of  the  recognition 
algonthm  used  and  the  end  product  is  a  high  resolution  image  of  objects  of  interest.  Performance  is  dependent  on  a 
number  of  factors,  primanly  how  well  the  recogniTion  algorithm  en^loyed  is  able  to  distinguish  objects  of  interest,  but 
also  the  nimiber  and  speed  of  the  objecfs,  to  ensiue  they  are  not  mming  too  fast  for  the  serv-o  motor  to  drive  the  high 
resolution  camera  to  capture  an  image 


7  COXCXUSIOXS 

Biometric  identification,  public  suneillance.  and  uuinarmed  vehicle  systems  all  currenily  suffer  from  one  common 
handicap:  then  use  of  low  resolution  i-ideo.  This  severely  restricts  their  abiUts'  to  accurately  achieve  their  respective 
goals.  The  couclusioo  of  this  research  i%  a  functioning  sj’stem  that  addresses  this  problem  by  allowing  the  accurate  and 
efficient  capture  of  Ingh  resolution  images  of  objects.  Tiie  abilit}'  to  detect  any  ripe  of  predefined  object  and 
autonomous  nature  of  the  end-s>'stein  allow's  for  simple  application  to  all  three  of  tliese  areas,  and  more 

S  ACKSOV^XEDGIMENTS 

Author  ^  I  gratefiilly  acknowledges  the  Office  of  Naval  Research  for  partial  si^jport  of  this  work,  via  the 
Narv  al  Acadeniv  Trident  Scholar  Ptogram,  on  funding  document  N0001406\\T120137. 

9  REFERENCES 

1.  C.  WreiL  A  Azabayejani.  T.  Darrel,  and  A  Pentland,  ‘  Pfmder:  Real-time  trackiug  of  the  human  both,"  IEEE 
Transactions  on  Partem  Anal^-sis  and  Machine  Intelligence  19,  pp  7SO-7E5, 1997. 

2-  K  Friedman  and  S-  Russell,  Image  segtnenTation  iu  lideo  sequences:  A  probabilistic  approach/'  Proceedings  of 
the  TUuteenth  .Annual  Conference  on  Uncertainty  in  Artificial  hitelUgence.  U.4I-97,  pp.  175'IS  L  1997. 

3.  Q  Zhou  and  J.  AggarwaL  Tracking  and  classifying  inoiing  objects  from  videos/'  Proceedings  of  EEE 
VVorfcsbop  on  Performance  Es^hiation  of  Tracking  arid  Sun-eillance,  2001 . 


4.  S.  Ha>1dii.  Neural  Networks:  A  Cmiprdmisi\^  Fctmario^,  pp.  138-229,  MacmiUan  CoUece  Publishing  Co.: 
KetvYork.  1994. 


47 


Froc:eedmE[5  of  Tbe  Nadonal  Conference 
On  Underaradiia:e  Research  (NCUR)  2007 
DoniLnicjui  Uni^rernn^  of  California 
San  Rafael.  C  alifomia 
Apnl  12  -  14,2007 

AutoQomous  DetectioQ  and  Imaging  of  Abandoned  Luggage 

in  Real  World  Environments 


Jereinie  Papon 

Electrical  Engineering  Department 
United  States  Naval  Academy 
10^  Majydand  Avenue 
Annapolis,  MD  21402-5025.  USA 


Facnlt^^  Advisors:  R.  Broussard^  R.W.  Ives 
Abstract 

Thi^  research  developed  a  system  that  is  able  to  detect  and  produce  high  resolution  imagery'  of  unattended  items  in  a 
crowded  scene,  such  as  an  airport,  using  live  ’’video  processing  techniques.  Video  suiv-eiilauce  is  commonplace  in 
today^s  public  areas,  but  as  die  number  of  cameras  increases,  so  do  the  human  resources  required  to  monitor  them. 
Additionally^  current  surveillance  networks  are  restricted  by  the  low  resolution  of  their  cameras.  For  example^  -while 
tliere  is  an  extensive  security  camera  uetw'ork  in  the  London  Underground,  its  low  resolution  pre^.-ented  it  from  being 
used  to  autonomously  identift^  the  terrorists  that  entered  the  train  stations  in  My  2005.  With  this  in  mind,  this 
project  developed  a  snrv^eillance  system  that  is  able  to  autonomously  monitor  a  scene  for  suspicious  events  by 
coihbining  a  low  nesoliibon  camera  for  siuveillance  (a  ’^■ebcam)  with  a  moving  liigh  resolution  camera  (a  6  mega¬ 
pixel  digital  still-&ame  camera)  to  provide  a  greater  level  of  detail.  This  eidianced  capability'  is  used  to  determine 
whether  or  not  the  event  is  a  threat. 

For  the  purposes  of  this  research,  suspicious  events  were  defined  as  a  person  lea’-ving  a  piece  of  luggage  unattended 
for  an  extended  period  of  time,  for  exajnple.  as  a  terrorist  jnight  do  when  placing  a  bonab.  Initial  analy'sis  of  the 
surveilbiiice  video  involved  separating  tlae  foregroimd  (such  as  people  carryhag  luggage)  &om  the  background.  In 
order  to  do  this  using  live  \ideo,  an  automated  algorithm  was  de^.- eloped  which  creates  a  composite  background 
image  from  a  small  number  of  \ideo  frames.  In  the  algorithm,  areas  detected  as  motion  were  subtracted  out  from 
individual  frajnes.  These  processed  frames,  which  represented  regions  of  no  motion,  -were  then  contoied  by  taking 
tlie  median  value  for  each  pixel  across  all  frajnes.  Tliese  median  values  were  used  to  form  a  con^osite  background 
image  which  contaiiaed  only  the  non-moving  parts  of  the  scene  (i.e.,.  the  background). 

Once  tins  background  was  obtained,  the  system  then  detected  live  motion.  Using  a  variety  of  filtering  techniques, 
individual  foreground  objects  were  separated  from  the  stationary'  background.  These  objects  ’^■ere  then  tracked  over 
time.  U^ien  an  object  (for  the  pirposes  of  this  research,  a  moving  object  ’^■as  assumed  to  be  a  person)  divided  into 
two  different  objects,  they  were  then  tracked  to  see  if  one  of  tlie  objects  remained  motionless.  In  doing  so,  the 
system  ’^■as  able  to  detect  an  “abandonment"  event.  When  such  an  event  occurred,  an  event  timer  began  to  determine 
how  long  the  luggage  had  been  abandoned.  If  the  luggage  was  left  unattended  for  a  preset  amount  of  time,  the 
system  ragged  it  for  high  resolution  imaging. 

Once  an  abandoned  item  ’^-as  tagged  for  high  resolutioD  imaging,  the  sy'stem  used  a  motorized  pan'tilt  mount  to 
point  the  liigh  resohmon  camera  and  acquire  a  liigh  resolutiou  image  of  the  item.  Tliis  image  was  tlien  sent  to  a 
human  supervisor  for  hinher  inveshgarion.  The  final  security^  system  allow's  a  single  person  to  monitor  a  vast  array 
of  camera  systems  (spanning  for  example,  an  entire  airport)  for  abandoned  luggage  or  auy  other  pre -defined 
suspicious  event  . 

Keywoi  cls:  SuiveiJliiiice.  Live-\Tdeo.  Motion-Deiectiou 


48 


i.  introcUicriou 

Vi3eo  sajveillmce  is  extremely  common  m  modem  public  areas,  and  as  the  number  oi  cameras  mcreases,  so  do  the 
nunlber  of  video  feeds  to  Vvatch.  This  places  an  mcreasing  burden  on  secuiit>^  personnel,,  as  more  and  more  mrst  be 
assigned  to  the  task  of  w asching  the  \^:teos.  AddltionailAi;  a  person  is  only  able  to  watch  so  many  video  reeds  at  the 
same  time  ^\^tli  any  sort  of  efficienc)^  and  as^idi  any  system  molvmg  liuman  inouitors,  are  subject  to  Iwinan  error. 
Take  for  example  the  London  Subway  Bombings  of  Jul)-  7*  2005,  id  wtiidi  the  terrorists  passed  directly  in  fl:cnt  of 
several  security  cameras  (Figure  1).  Even  though  the  tenorisTs  were  on  several  \\atch  lists,  and  security  personnel 
^^^re  ill  a  heig;iteue^.  state  of  alert  diey  were  allowed  fitc  access  to  die  snbivays,  and  niauaied  to  leave  their  bags, 
^^'luch  contained  bombs,  rfirougboiu  the  London  tbdergro^ind 


Figure  T  TerTfirisrs  in  T jnnfinn  Stihwafs'  mi  July  7=^  70f)i 


In  addition  to  the  fallibility  of  fiuiiian  monitors,  cunentsun^Jlajice  netivDdcs  suffer  tom  another  inajoi  drawback; 
Tlreir  low  resolution.  Tliis  severely  Jindts  the  ability  lo  Identify  direats  \u  autcmated  algodthins.  such  as  tlie  one 
pi'esciited  in  tins  paper.  Tlie  low  fesol-iiion  of  current  securiij-  systems  also  prevents  the  inplenientariCLi  of  cov'ert 
auxjiiiated  biOLiietric  ideiinficanou  of  people.  As  such,  tins  researtlihas  dewloped  a  low-cost  dual  camera  vysfem, 
w-liicli  coiubinss  a  low  resoluhcn  ^ideo  camera  with  a  high  resolutroii  camera  mouiued  on  a  moving 

pa:i-1ilt  mount.  Tins  allows  for  analysis  of  the  \ideo  to  detenum#  objects  of  interest  and  automatic  investigation  of 
tlicse  objects  using  the  iugli  resolution 


2.  Desei  iptioii  of  system  components 

Tlie  low  resolution  camera  consMs  of  a  Logitech  Quickcam  Pro  50(K)  (Figure  2>.  This  camera  has  proven  ideal  for 
tlie  purpose  of  surv^etllaiice  because  it  ^ombioes  resohitioii  (645  x  4S0  pixels),  fast  frame  rates  (10  frames  per 
second),  and  a  wide  field  of  view  (82  degrees)  in  a  small  robust  package.  The  Quickcam  is  stahonarv^  md  sua^e^'s 
the  entire  area  of  inierest  Jt  onre,  ronimioiisly  muni  taring  fhr  nTCfoemerit  UTieii  movement  in  the  field  of  view  is 
de:ected  the  camera  output  is  pressed  for  object  locaUzaticiL  The  Quickcam  interfaces  directly  with  MATLAB 
vua  a  USB  cab:e,  which  allows  for  qiiick  processing  of  the  video  feed 


Figuic  2:  Quicfcctfjii  Pro  5000 


Figure:  3:  Canon  Poi\^rshot  S3IS 


Figure  4:  Pan-Tilt  Mount 


Tlie  hjgli  resolutioii  camera  consists  of  a  6  mega-pixel  (a  mega-pixel  is  one  million  pixels)  Canon  Pouersfiot  S3IS 
(Fi^ire  2)  This  rampra  u-aR  rhosen  heraiisp  nf  its  omd  low  li^  pedbirnatice  and  motorized  1  ?  timpR  optirsi  zoonn 
The  optical  zoom  is  extremely  fes:,  and  can  be  controlled  lemoteh^  which  allows  fsr  imaging  cf  small  objecls.  The 
camera  is  mounted  on  a  high  speed  pan- tilt  meunt  for  poiming,  and  comes  with  a  Sofh™e  Development  Ki^  (SDK) 
wliich  allows  for  remote  control  of  the  camera  over  a  USB  cable. 

The  pan.- tilt  mount  k  a  Directed  Perception  PT1'-D46-17  wiiich  can  poo  a  fiill  36C  degrees  and  tilt  ISO  deg:ees,  at 
speeds  of  enw  300  degrees  a  second.  The  mount  allows  for  fast,  stable,  and  accurate  pointing  of  rhe  high  reschmon 
camera.  The  pan-'tilt  luomif  is  controlled  serially,  and  can  be  commanded  directly  from  the  GUI  using  the  serial  port 
fimctionality  built  into  M4TL.\B. 


.V  Metboilology 

Die  first  part  of  reliable  foregroinid  segmemanoa  is  the  creation  of  an  accurate  bacegr ound  image.  Tliis  is  easy  in  a 
Liborator^  where  one  can  singly  clear  the  area  of  people  and  other  foreground  objects.  Unfortunately,  in  real  world 
applicatiens  it  can  be  difficult,  if  not  nnpossible.  to  clear  an  area  in  order  to  get  a  background  image.  Additionally, 
even  if  one  cle^s  an  area  and  obtains  an  acemate  badcgroitnd  image  when  a  camera  is  installed  any  changes  in  the 
background  (such  as  3  light  burning  out  or  furniture  being  mox-ed)  wiE  WTeak  hewoc  on  the  s}^stem’s  foreground 
segmenta:ion.  With  this  in  mind,  tiis  research  has  developed  a  robust  algorithm  {Figure  5)  w'hicii  is  able  to  take  live 
wdeo  and  dynamically  create  a  "conisosite'"  tacteround  image.  Using  this  algoritinx  backgrounds  can  be  created 
even  with  people  present  in  the  field  of  \Tew  of  the  camera.  Additionally^  because  lie  algorithiit  w^orts  quickly  and 
can  be  used  on  hve  v'deo,  it  can  be  used  to  periodically  update  the  background  image  automatically  to  account  for 
changes. 

Tlie  first  step  in  the  background  creation  is  simple  ffiuDe  to  fiame  differencing  and  thresholding  to  detect  regions 
that  contain  movement.  Threshold  values  for  creation  of  the  binary"  image  are  calculated  using  Otsu's  method  [1]. 
Diis  creates  a  binaiy^  difference  iiuage  (or  tuask)  uiieie  tinar}'  1  represents  motion  Tins  difftrence  mask  is  then 
processed  with  bmaiy  luorpholojr/  (diLation)  lo  account  for  noise  in  Ute  thresholding  and  reduce  tlie  possibilitj'  of 
iioii-sfatic  aress  passing  tlirough.  "tus  mask  is  then  corc^l^mented  so  that  static  regions  (those  winch  will  b;  called 
the  background)  are  togical  Ts  lu  the  mask.  Tins  mask  is  then  rmilmplted  by  the  ctirreni  fiaitie  to  creare  a 
ntotjonless  image  frame,  which  centains  only  background  areas.  This  notiooi^s  kuage  is  then  stored.  Dus  process 
is  repeated  until  ex-ery  pixel  has  at  least  5  ■'m&ticmless”  ^'allies.  In  testing  this  was  shown  to  take  generally  about  4 
tecondR  (at  25  firamet  per  second)  for  normal  surs^eilLmce  video  clips.  Once  the  array  of  motionless  images  m 
created  tie  median  x".lue  for  each  pixel  is  calculated,  and  used  to  fonn  the  final  composite  backgroinid  image.  An 
example  of  the  background  creation  is  ^howu  in  Figure  6.  The  sun'eillaiice  clip  used  comes  from  a  London  train 
statiou,  filmed  for  the  "Ferformance  Evaluation  of  Trackiog  and  Surveillance  Conferieoce  2006"  (PETS  2006)  [2]. 


50 


□tmnm  im 


( 

I 

I 


Figure  5 :  Composiife  bsickgrowid  creation 


Figure  6:  Esaiiple  of  composite  bacligroimd  cxeatioi]!  £r(Mn  live  \^deo 


Ouce  the  composite  background  image  is  created,  foreground  segmentation  is  posable.  TMs  is  accoiiplished 
usinff  the  al^-oridnn,  shown  in  Figure  7.  The  fust  step  in  ^he  process  differencios  and  thresholdmg  both  the  ciirreut 
\idto  &anie  aod  previous  fiaitie  with  the  background  unage  to  create  a  binary  mask,  where  binary^  1  represents 
regions  not  in  dr  backpound.  These  masks  are  then  filtered  using  a  3  by  3  majority  filter  and  then  morphDlogically 
dosed.  These  twio  difieremt  masks  are  then  combined  using  a  logical  AND.  This  process  of  combining  the  current 
frame  mth  the  previous  fiame  was  thouTi  to  be  extremely  effective  in  testtng,  and  resulted  in  more  accurate 
segmentation  results,  than  could  be  attamed  by  using  only  the  current  fiaor.  _An  exan^Ie  of  the  foreground 
segmentation  process  ^)plied  to  the  TETS  2)006  data  set  is  shown  in  Figure  S.  Once  (he  fiireg)Poimd  mask  is  obtained, 
the  algorithm  moves  on  to  the  nest  stage,  object  traddug. 


51 


Figure  7:  Foregroimd  segnteoration  and  object  tracking  algnodim 

Tlie  final  stage  of  the  algorithm  is  the  actual  tracking  of  foregromd  objects.  Fore^ound  objects  consist  of  regions 
of  contiguous  binar,"  Ts  in  the  foreground  mask.  The  size  and  center  of  each  of  these  regions  is  calculated,  and 
conq>ared  the  objects  present  in  the  prev  ious  iteradoiL  If  an  object  is  “neiA-"  ’,  it  is  stored  in  the  object  strucuire. 
In  testing,  netv  objects  were  either  people  entering  the  scene,  a  groiqs  of  people  separating,  or  a  person  and  luggage 
separating.  If  an  olqec:  was  present  in  the  previous  fiaiiK.  its  siansncs  are  u^ated  and  its  motion  vector  (direchon 
and  speed)  is  calculated.  This  motion  vector  is  used  to  detexmide  if  an  object  is  idle:  if  tlie  object  remains  idle  for 
longer  than  a  threshold  tune  r,  it  is  queued  for  imagmg  usmg  iJ^e  high  resolution  camera  In  testing,  it  was  shown 
that  people  standing  sail  vtere  rarely  detected  as  ’  idle“  because  e\'en  if  a  person  is  not  nwing.  ihe>'  are  seldom,  if 
ever,  perfectly  still.  Abandoned  luggage  on  the  other  hand  was  detected  witli  a  grear  degree  of  reliability,  because 
unlike  people,  it  is  perfectly  motionless  for  siptficant  amounts  of  time.  \Vhtn  an  object  is  queued  for  lugh 
resolunou  nnagmg,  iii  size  is  used  to  detemiine  zoom  and  its  coordinates  are  passed  to  the  pan  hit  nK>un[  to  point 
the  camera. 


Figure  S:  E.xanq>le  of  foreground  segmentation  in  London  tram  station 


4.  D:it;i 


52 


The  first  testing  conducted  was  of  the  background  creation  algoritliiu.  Five  test  scenario  movie  clips  were  used,  3 
from  the  PETS200b  datab  ase  and  two  from  video  of  public  areas  rec  orded  at  the  United  States  Naval  Acadenr>\ 

Each  of  these  test  scenarios  was  processed  10  times,  wixh  the  first  test  conducted  starting  at  the  beginning  of  the  clip, 
tlie  second  at  10  seconds,  the  third  at  20  seconds,  and  so  on.  The  first  test  was  used  to  determine  the  time  it  took  to 
calculate  the  background  pixel  values  from  the  motionless  unages  using  either  an  averaging  or  median.  As  can  be 
seen  in  Figure  9,  the  median  algorithm  took  longer  in  all  five  scenarios. 

The  next  test  focused  on  tlie  accuracy  of  the  background  images  created  in  the  previous  test.  Because  tliis  research 
used  pre-recorded  clips,  it  was  possible  to  obtain  'perfect  background  images;  each  clip  contained  at  least  one 
frajne  where  there  was  no  foreground  present  in  the  field  of  view\  By  using  these  as  references,  it  was  possible  to 
test  the  accuracy  of  each  background  image  created  by  taking  the  absolute  error  of  each  pixel  in  tlie  composite 
backgrounds,  and  averaging  them.  Tliis  gives  Figure  10,  which  shows  the  average  error  for  the  median  and 
averaging  algoritlims  in  the  five  scenarios.  It  is  readily  apparent  that  the  median  algorithm  is  significantly  more 
accurate  in  all  cases. 

The  fmal  stage  of  testing  involved  using  the  sy^item  in  staged  scenarios.  The  entire  system  was  set  up,  and  subjects 
were  instructed  to  enact  different  scenarios.  The  tests  ranged  from  a  crowd  of  people  walking  back  and  fortli  with 
no  baggage  abandoned  to  having  ever^^one  in  the  crow  abandon  luggage.  As  can  be  seen  in  Table  1,  tlie  system 
correctly  detected  abandoned  luggage  100%  of  the  time  for  all  scenarios  with  fewer  than  3  pieces  of  abandoned 
luggage  (a  drop  consisted  of  a  person  abandoning  an  item  they  entered  the  scene  with).  Once  it  reached  three  drops, 
the  system  began  to  get  saturated,  and  fail  to  detect  abandoned  items.  False  detections,  which  were  defined  as  the 
system  queuing  a  foreground  object  for  photography  when  it  was  not  abandoned  luggage,  were  rare,  especially  in 
tests  involving  small  nunibers  of  people. 

Table  1 :  results  of  testing  on  staged  scenarios 


Correct 

Frames  to  Acqui  re 

False 

Detection 

Object  as  Idle 

Detections 

Nothing  Idle,  People  Wandering 

N/A 

N/A 

1 

1  Person,  1  Drop 

100% 

12 

0 

Multiple  People,  1  Drop 

100% 

10 

0 

2  People  2  Drops 

100% 

0 

Bag  1 

TRUE 

20 

Bag  2 

TRUE 

16 

Multiple  People,  2  Drops 

100% 

1 

Bag  1 

TRUE 

24 

Bag  2 

TRUE 

20 

3  People,  3  Drops 

100% 

0 

Bag  1 

TRUE 

20 

Bag  2 

TRUE 

15 

Bag  3 

TRUE 

14 

Multiple  People,  3  Drops 

66.67% 

2 

Bag  1 

TRUE 

26 

Bag  2 

TRUE 

22 

Bag  3 

FALSE 

N/A 

Multiple  People,  S  Drops 

75.00% 

1 

Bag  1 

TRUE 

28 

Bag  2 

TRUE 

22 

Bag  3 

TRUE 

24 

Bag  4 

TRUE 

24 

Bag  5 

FALSE 

N/A 

Bag  6 

TRUE 

27 

Bag  1 

TRUE 

21 

Bag  6 

FALSE 

N/A 

i 

I 


f 

^JK»^ - - - 

I  ■ 

I 

IJPR - j - 

f 

3JMB - 

4 


t 

1 


IMIiui  tif 


D.DOai 


I 


4 


-r - 

i 


■t 


Figure  9;  Time  to  create  background  images 


144tlnfwlB 


I 

I  IMvi 
I  fig^m 

Mill  Df 
TKM  nn 


Fi^inc  10„  Einji  in  bat:k.uiuLmd  citratioi  al^uiiltniis 


54 


?.  CoikIusioii 

Oiie  of  flat  primar>-  difScultitt  m  workmg  m  auromatt^l  ^ecurif>'  ui  flat  rtal  world  is  that  vanation^  b 
til?  scetitty  cm  cause  ygmficaat  mors.  Creatmg  an  a]gotritiim  mat  is  able  to  adapt  itseJf  to  any  en’smniiwnr 
presented  to  it  i$  oae  of  the  biggest  burdles  to  o^  erconie  in  order  to  create  a  reliable  autonomous  secuntj-  sj-stem. 
This  research  successfully  accoutpLsbes  this  uith  the  background  creation  algorithm.  Tbe  s^'stem’s  abilits'  to  rapidly 
create  an  accurate  representation  of  an  '  en^'"  scene  allows  it  to  be  deployed  at  any  location  and  used  iiumediarely. 
Additionally,  the  use  of  tenoral  differencing  in  tbe  foreground  segmentation  algorithm  prosed  extremely 
successful  for  creatmg  an  accurate  foreground  mask. 

The  high  resolution  camera  system  prmTed  to  be  extremely  useful  in  testing.  aUow-ing  for  distant  objects  barely 
visible  in  tbe  low  resoluhon  video  to  be  photographed  m  great  detail.  Wlhle  the  objects  could  not  be  identified  in  the 
video  feed  the  bigh  resolution  photos  allowed  the  human  si^perv  isor  to  positively  idsitify  them  as  abandoned 
luggage.  A  sun  eidance  system  could  easily  be  set  up  using  any  number  of  the  high  low  resolution  camera  systems 
and  monitored  by  a  single  supervisor,  since  tbev'  wxnild  only  need  to  occasionally  examine  the  high  resolution 
photographed  winch  tbe  sy'stems  flagged  as  securitv"  risks. 

The  problems  encountered  in  testing  vseie  primaiih'  due  to  p€5or  bghbng  conditions  in  the  test  area.  Flickering  and 
dim  li^ts  caused  some  error  in  foiegrouud  segmeatation  and  also  occastonalLy  prevented  the  high  resolution  camera 
from  focusing  properly.  The  problem  of  ftickering  bgbts  could  be  solved  in  future  work  bv^  adjusting  the 
segmeutariou  to  account  for  them  (for  example,  adjusting  the  algorithm  to  ignore  periodic  noise).  The  problems  with 
the  high  resolution  imagiug  could  be  solved  by  using  a  true  single-lens  reflex  (SLR)  camera  (none  were  readily 
available  for  use  in  this  reseandi).  or  a  camera  with  better  low  light  imaging  capablUn^ 

6.  Ackuowletigemeuts 

The  author  wishes  to  express  his  appreciabon  to: 

Di.  Robert  Iv'es,  Electntal  Engineering  EJepaitmeni.  USNA  -  Priman.  Project  Adviser 

Dr.  Randy  Broussard.  Weapons  and  Sprems  Engineering  Departmeiit.  U^A  -Secondarv-  Project  Adviser 

Mr.  Jefferv-  Dunn.  National  Secunfv'  Ageiic>‘  -External  Collaborator 

Ms.  Daphi  Jobe.  Electrical  Engineenng  Department.  LTSNA-  Electronics  Technician 

Mr.  Jerry'  Ballmaii.  Electrical  Engineenng  Department.  USNA-  Laboraiorj^  Technician 

Mr.  Micltael  Wilson.  Becwical  Engmeering  Department.  USNA  -  Laboratory'  Technician 

7,  References 

[1]  Otsu.  K,.  "'A  Threshold  Selection  Method  from  Gray-Level  Histograms."  IEEE  Transactions  on  Systems.  Man. 
and  Cj-bemencs.  Vol.  9,  No.  L  1979.  pp.  62-66. 

[21  Pedbimauce  Evaluation  of  Trackmg  and  Surveillance  Conference  2006.  PETS  2006  Benchmark  Data*', 
http^-OUwv.a'g  rdg.ac.uk  PETS2006  data  Jitml 


8.  Works  Cousulterl 

1  N  Friedman  and  S.  RusselT  "Ifuage  segmmtation  in  video  sequences:  A  probabilistic  approach,"  Proce^ings  of 
ihe  Uiine&irfi  Coifisre^^ce  on  Liicefroint}^  iti  Artipdoi  {UAI-97),  pp  175-1  SI,  Morgan 

Kaufluauu  Publishers.  luc.,  (San  Francisco,  CA),  1997. 

2.  Q.  Zhou  and  I.  Aggarw^l.  Tracking  and  classiftiog  moving  objects  from  videos/'  ProcG^mgJi  of  IEEE 

on  Pe^fomianc^  E\ahintio7!  of  Tracking  and  Sinyeiliance,  2001 

3.  L.  Li.  W.  Huant;  I  Yu-Hua  Gu,  and  Q  Tiaa  "  Statistical  Modeling  of  Complex  Backgrounds  for  Forcgriound 

Object  Detection.'’  IEEE  Tiwtsaaiofis  Of}  Image  Processing,  vol.  13,  no.  1 1,  Nov.,  pp.  1459-1472. 


55 


Appendix  E:  Original  Project  Report 

Due  to  unforeseen  human  subject  testing  issues,  the  nature  of  this  Trident  Project  was 
modified  in  December  2006.  While  the  hardware  and  its  control  functions  for  the  original  project 
were  able  to  be  used  in  the  new  project,  the  algorithms  were  not  applicable  to  the  new  topic.  As 
such,  the  original  project  report  has  been  included  here  to  document  the  accomplishments 
achieved  during  the  first  semester  of  the  Trident  project. 


Ong[iiial  Project  Abstract 


56 


After  tlie  London  subway  boiubmgs  of  7  Jxily  2005,  British  authorities  were  able 
to  use  recordings  from  ihe  extensive  siwveiUance  systein  installed  in  the  subivay  system 
to  identify  potential  suspects,  which  proved  a  great  boon  to  their  investigation,  and  led  to 
arrests  5  days  after  the  bombings.  Unfominately,  current  surveillance  camera  networks 
only  provide  frizzy  low  resolution  imagery',  and  have  no  ability  to  conduct  automated 
screening  using  facial  recognition.  Because  of  this,  London's  huge  network  of  security 
cameras  could  not  prevent  the  terrorist  attacks;  they  could  only  help  in  the  inve&rigatioii 

The  primary'  goal  of  this  project  is  to  dev  elop  a  complete  surveillance  system  that 
integrates  a  low  cost,  wide  angle,  low  resolution  camera  with  a  high  resolution  digital 
camera  in  a  complete  self  contained  package  The  system  uses  a  high-speed  motorized 
pan^tilt  moiuit  to  allow  the  high  resolution  camera  to  focus  in  on  any  point  in  the  low 
resolution  camera's  field  of  view.  The  low  resolution  camera  is  connected  to  a  computer 
which  runs  a  wmdOsvs  program  that  uses  neural  networks  to  locate  individual  faCCS  in  the 
field  of  view.  These  coordinates  are  then  used  to  move  the  pan.  tilt  motuit,  so  that  the  high 
resolution  camera  can  take  a  closo  up  image  of  the  subject'  i  face. 

Testing  involved  the  set  up  of  one  camera  tmit  in  a  building  of  the  Naval 
Academy,  with  volunieei'  Midshipmen  used  as  subjects.  The  system  ran  continuously 
with  data  processing  being  done  on  a  laptop.  Individuals  who  voiiuiteered  were  asked  to 
walk  down  the  hallway  m  varying  numbers,  from  one  to  six  people  at  a  time,  This 
allow'ed  for  controlled  testing  to  determine  the  system’s  reliability  and  response  speed. 


Table  of  Contents: 

1 .  B  lomef ncs  and  Secunt>' . . . . 

2.  Project  Descinpriou.. . . . 

3.  [uteifacing  of  Components . . . . 

4.  Foregrciii.nd  Extraction  from  Live  Video 

5 .  C  ol  0  r  Segm  eiita  tion  for  Skin  Detect lo  n . 

6.  Neural  Network  for  \Vindo\v  Clasvificatior. . . . 

7.  Overall  Operatioji . . . 

S .  System  Testing . . . . . . . . . . . 

9 .  C  onclus  joiis . . . . . .  . . . 

10.  Spring  Semeister  Deliverables . . . 

11.  Works  Cited . 


12.  Works  Consulted 


58 


Table  of  Figures 


Figure  1 :  Terrori5r&  iii  Loudon  Subway  on  July  7*  2005  . . . 6 

2.  StrtTLiLily  Camc:i ti  Syslt:m . . . . . . . . . . . . . 7 

Figure  3:  Q\iickcamPro  5000....................... . . . . . . . ...  S 

Figure  C^^iion  Power^hor  S3IS.. . . . . . . . . . . . .  S 

Figure  5:  Directed  Percepiioiv  PaaTilt  Motuit,,..,-. . . . . 9 

Figure  6:  TUe  System . . . . . . . . . . . . . . . .  10 

Figure  7:  Morion  Segmentatioii.,_„. . . . . . . . . . . . 12 

Figure  K:  Skin  Segmentation  Data  StiiiJCtiire . . . . . . . . . 14 

Figure  9  :  Skin  Segmentation . . . . . . . . . . . . . 15 

Figure  10:  A  Neuron . . . . . . . . . . . . . 15 

Figure  1  1  ■  A  Feed  FottlVs  rd  Neural  Netwoii:  1  7 

Figure  12:  Resizing  and  Transformation  . . . . . . . . . 19 


Figure  13:  Overall  Outline  of  System  Opeian-on.__ 


21 


59 


L  Biometrics  and  Seem  it v 

* 

Biometrics  is  the  general  tenn  for  a  Large  field  of  study  whicli  consists  of  using 
digital  signal  processing  to  automatically  recogni2e  individuals  based  on  intrinsic  liuuiaii 
traits.  These  traits  can  be  anytbiug  svliich  is  luiiqiie  about  indiv  iduals;  systems  have  been 
developed  to  recognize  faces,  irises,  retinas,  fingerprints,  voice,  and  even  gait.  On  the 
surface  the  problem  may  seem  elemeutars^,  human  beings  instinctively  recognize 
hundreds  of  people  with  few%  if  any,  errors  on  a  daily  basis.  The  simplicity'  with  which  we 
recognize  people  disguises  the  difficulty  of  the  problem.  The  study  of  biometrics  over  the 
past  15  years  has  shown  that  the  problem  is  far  more  complex  then  it  would  seem  at  fnst. 
This  fact  does  not  surprise  biologists,  since  over  50%  of  the  brain  is  used  for  visual 
processing.  In  fact,  contrary  to  what  may  seem  logical,  we  use  far  more  brain  power  in 
recognizuig  people  than  we  do  in  purely  intellectual  pursuits  such  as  mathematics  [2], 

The  first  issue  to  be  dealt  with  is  how  to  obtain  standardized  data.  We,  as  humans, 
accomplish  this  instinctively;  w^e  simply  turn  our  heads  and  look  at  the  individual  we 
wish  to  identify'.  Clearly,  this  task  is  far  more  compUcated  for  a  stationary  camera  to 
accomplish.  Early'  biometric  systems  avoided  this  problem  by'  only  fimctioning 
effectively  if  the  subject  to  be  identified  looked  directly  into  a  camera.  Bypassing  the 
problem  by  putting  a  constraint  on  where  a  subject  must  look  does  not  provide  a  viable 
solution  There  is  a  need  for  systems  ivbicb  are  able  to  identify  individuals  without 
interacting  wuth  them.  This  leads  to  a  more  complex  problem,  wliicli  involves  real  time 
pattern  recognition  to  locate  individuals  in  a  scene. 


60 


Secondly,  the  amount  of  data  collected  that  must  be  processed  in  real  time  must 
be  taken  iiilo  consideration.  Recorded  signals  of  human  traits,  be  they  oue-diuieiisional 
signals  sucli  as  voice,  or  three -ditnensional  images  of  a  person’s  face,  contain  far  too 
much  itiformation  to  be  processed  directly.  These  images  or  recordings  must  first  be 
analyzed  using  automated  algorithms  to  reduce  them  into  smaller  metrics  to  be  used  for 
computational  coii:^arison&. 

Once  the  data  metrics  are  obtamed.  there  are  two  major  uses  for  it  - 
verification  identification  and  compahsoiL  The  first  consists  of  the  verification  or 
identification  of  an  indi  viduars  ideniity.  This  is  the  verification  that  a  person  is  who  they 
claim  to  be  (oiie-to  oue  comparison)  or  their  identity  may  be  determined  automatically 
from  a  hst  of  known  indhiduals  (one-to-many  comparisou).  The 
verificatioii'ideutification  process  is  primarily  used  to  control  access  to  secure  facihties  or 
networks.  The  other  use  is  comparing  individuals  passing  through  a  certain  area  such  as 
an  airport  terminal  to  a  'watch- list’  that  may  consist  of  a  list  of  terrorists  or  wanted 
criminals  (a  mauy-to-many  comparisoti)  This  application  is  far  more  complex  because  it 
requires  collecting  data  from  many  subjects  simultaneously  and  rapidly  checkmg  them 
against  a  database  that  could  potentially  be  v^ry  large.  Additionally,  such  a  system  is 
only  usefiil  if  subjects  are  unaware  that  they  are  being  observed,  so  it  must  be  able  to  scan 
large  areas  that  could  potentially  contain  nmltiple  subjects  at  the  same  time.  3y  taking  all 
of  these  individual  elements  into  consideration,  this  research  has  developed  a  system  that 
requires  no  user  interaction  and  could  be  used  m  \vatch-list'  application  to  monitor  for 


suspects. 


61 


2.  Project  Description 

Facial  recognition  is  ciureutly  the  most  promising  covert  biometric  system 
primarily  because  the  face  is  the  most  distinct  human  trait  that  can  be  seen  from  a 
distance.  There  are  many  available  schemes  for  facial  identification  currently  being  used, 
but  generally  speaking,  the  better  the  image  quality  (resolution,  lighting,  and  orientation), 
the  better  tlie  recognition  rate.  Currently  several  cities  and  airports  have  implemented 
security  camera  networks  which  seek  to  use  stationary  cameras  to  obtain  images  of 
indi^'idua]'s  faces  in  order  to  screen  them  against  a  watchlist. 


Figure  1:  Ten  orists  iu  Loudau  Subway  on  July  7"^  2005 


These  systems  have  had  few  successes,  such  as  the  capturing  of  the  July  7,  2005 
London  Subway  bombers  on  film,  but  for  the  most  part  have  been  ineffective  at 
identifying  individuals.  For  instance,  the  London  Borough  of  Newham  has  bad  a  facial 
recognition  system  built  into  then  borough- wide  Closed  Circuit  TeleWsion  (CCT\^ 


62 


system  for  several  years,  and  has  yet  to  identify'  a  single  suspect  on  the  watch-list,  even 
with  several  of  the  known  suspects  actually  residing  in  the  borough. 

The  primary  problem  with  these  CCT\'  systems  is  the  quality  of  the  images  tliey 
take;  security  cameras  are  generally  of  low  resolution,  which  greatly  restricts  the 
accuracy  of  a  facial  recognition  algoiithra.  This  project  seeks  ro  resolve  this  problem  by 
developing  an  advanced  security  camera  imit  which  acmally  contains  two  cameras;  a  low 
resolution  web  camera,  and  a  high  resolution  still-frame  digital  camera.  The  block 


diagram  in  Figure  2  shows  the  system. 


Figure  2:  Secnriw  Camera  Ss'srem 


The  low  resolution  camera  consists  of  a  Logitech  Quickcam  Pro  5000  (Figure  3). 
This  camera  has  proven  ideal  for  this  purpose  because  it  combiues  good  resolution  (640  x 
480  pixels),  fast  frame  rates  (30  frames  per  second),  and  a  wide  field  of  view  (82  degrees) 
ill  a  small  robust  package.  The  Quickcam  is  statioiiar>%  and  survey's  the  entire  area  of 
interest  at  once,  continuously  monitonng  for  niovenient.  Wlien  tnos'einent  in  the  field  of 
view  is  detected,  the  camera  output  is  processed  for  face  detection  and  localization. 


63 


Figure  3:  Qakkcam  Pr<]  5000 

The  hisli  resolution  camera  consists  of  a  6  mega-pixel  (a  mega-prxel  is  one 
million  pixels)  Canon  Povvershot  S3IS  (Figure  4).  The  camera  is  moimted  on  a  high 
speed  Directed  Perception  Pan/Tiit  mount  which  can  pan  (left/right)  a  full  360  degrees 
and  tilt  (up/down)  180  degiees.  The  camera  takes  still- frame  images  which  could  be  used 
for  facial  recognition  processings  Producing  these  high  resolution  images  is  the  primar^^ 
purpose  of  the  s]k"sfem. 


Tigiiie  4:  Canon  Powersbot  S5IS 

With  the  detail  provided  m  a  6  mega-pixel  image,  far  more  advTuiced  facial 
recognition  processing  is  possible  than  with  low  resclufion  video.  The  most  promising  of 
tliese  Ingher  resolution  techniques  is  Skhi  Texture  Analysis  (STA)  developed  by  the 


Identix  Corporation,  which  realizes  much  higher  recognition  rates  than  was  possible  with 


64 


older  techniques,  such  as  Eigenfaces  [3], 

3.  lutei'faciug  of  Components 

The  first  major  hurdle  in  developing  a  system  that  combines  various  off-the-shelf 
components  into  a  single  working  umt  is  interfacing.  Both  the  high  resolution  camera  and 
the  moimt  required  separate  controls  to  be  developed  in  order  to  meet  their  desired 
functionality.  This  was  accomphshed  in  MATLAB. 

Control  of  the  Pan/Tilt  mount  (Figure  5)  was  accomphshed  using  the  computer \ 
serial  port.  Tlie  serial  port  is  used  to  send  commands  to  the  Pan/Tilt  mount ^s  controller, 
and  allows  control  of  both  the  moi^ement  of  the  mount  and  all  of  its  settings,  such  as  slew- 
rate  and  acceleration. 


Figure  5:  Directed  Peivqjtion  PanTilt  Mount 


Controlhng  the  Canon  Pow^ershot  proved  to  be  much  more  difficult.  The  primary 
reason  for  this  is  that  the  Canon  Software  Development  Kit  (SDK),  w^hich  provides  a 
basic  framework  for  communicating  with  their  cameras  was  written  in  the  C 
programming  language.  In  order  to  use  the  camera,  all  of  the  control  functions  had  to  be 
ported  into  MATLAB  using  MEX  files,  w^hich  can  interface  C  code  with  MATLAB. 


65 

MEX  files  are  a  specific  type  of  MATLAB  file  wbich  are  written  in  C,  but  can  compile 
and  nui  m  MATLAB.  MEX  files  for  initializiiig  the  camera,  changing  camera  settings, 
and  controlling  the  shutter  were  created.  But  the  actual  transfer  of  images  from  the  high 
resolution  camera  to  the  computer  o^'er  the  USB  (Universal  Serial  Bus)  cable  proved 
problematic,  as  the  Canon  framework  for  doing  this  could  not  be  ported  into  MATLAB. 

As  such,  new  ftmetions  were  written  in  C  which  transferred  the  image  data  fioni  the 
camera  over  the  USB  cable  in  individual  packets.  Packet  sizes  in  initial  testing  were 
limited  to  only  1024  bytes,  ivhich  resulted  in  veiy'  poor  transfer  speeds  for  images 
(upwards  of  5  seconds),  After  several  adjustments  and  revisions  to  the  transfer  protocol. 

12S  kilobyte  packets  were  used  successfidly,  which  increased  transfer  speeds 
substantially  (approximately  one  second  for  a  one  megabyte  image).  The  cornpleTe 
assembled  system  can  be  seen  in  Figure  6. 


Figure  6:  The  System 


66 


4.  F oregi'ouud  Extraction  from  Live  Video 

The  first  step  in  defecting  motion  in  video  is  extracting  regions  of  interest  fironi 
the  background.  There  are  two  approaches  to  this;  motion  segmentation  and  backgroimd 
subtraction.  Motion  segmentation  is  implemented  by  fmdmg  the  absolute  difference 
between  the  current  frame  and  a  previous  frame  and  then  thresholding  the  result.  The 
equation  for  doing  this  is  shown  in  equation  (5-1)  where  pixel  m  frame  k  is 

subtracted  from  a  frame  !  frames  ago^  and  fhresholded  using  a  threshold  value  T.  Here,  m 
and  n  represent  the  row  and  column  numbers^  and  c  represents  the  three  color  planes  of 
red.  green,  and  blue. 

(m,  fT,  c)  -  Xjf (h7, n , c)|  >  T  (5-1) 

All  example  of  motion  detection  from  the  system  is  shown  in  Figure  7.  As  can  be 
seen^  there  is  a  substantial  border  around  the  people;  tins  is  due  to  the  morphological 
techniques  used  on  the  areas  ivhere  motion  was  defected.  The  morphology'  consists  of 
erosion  to  eliminate  noise,  and  then  dilation  to  expand  the  areas  of  motion.  This  is  done 
because  it  was  determined  in  testing  that:  it  is  better  to  get  extra  area  surrounding  detected 
motion  in  this  step,  rather  than  risk  losing  areas  that  contained  people’s  faces  due  to 
processing  tolerances.  Additionally,  it  can  be  seen  that  there  are  some  areas  where  motion 
is  detected  that  are  in  fact  not  moving.  This  is,  for  the  most  part,  due  to  variations  in 
lighting  and  video  noise.  While  motion  segmentation  is  computationally  efficient  and 
very^  adaptable  to  changing  backgrounds,  testing  has  showm  that  it  has  limitations. 
Variations  in  lighting,  video  noise,  and  stationary"  foreground  objects  can  cause  if 


significant  problems. 


67 


Figure  7:  Motion  Segmeutatioii 


Background  subtraction,  while  more  computationally  intensive,  can  be  more 
reliable.  This  technique  requires  building  a  composite  background,  which  can  be  done  in 
many  different  ways,  each  which  liave  their  own  advantages  and  disadvantages.  Tlie  most 
common  methods  are  Kalman  filtering  [4],  misture  of  Gauss lans  [5],  and  median  filtering 
[6],  For  the  purposes  of  this  research,  none  of  these  teclmiques  were  computationally 
efficient  enough  to  be  applied,  so  a  new  technique  is  being  developed  that  builds  on  these 
techniques.  The  base  premise  of  the  approach  is  that  pixel  values  for  backgroxmd  will 
reoccur  quite  often  over  time,  with  slight  variations  due  to  lighting.  As  such,  a  composite 
background  is  built  where  weights  are  assigned  for  the  values  of  each  pixel,  depending  on 
frequency  of  occurrence  over  time.  These  weights  are  adjusted  in  each  frame  according  to 
equation  (5-2),  where  Wj  specifies  the  weight  for  intensity"'  value  4  T  the  learning  rate, 
and  K  is  1  for  the  new  value  and  0  for  othiers. 

As  new  values  appear,  they  are  added  to  the  list  of  weights,  and  so  over  time,  the  most 
commonly  occurring  values  for  each  pixel  will  be  given  large  w^eights.  If  a  foregroimd 


object  moves  into  the  scene,  its  weight  value  will  be  extremely  low,  and  thu5  can  be 


68 


segmented.  This  technique  is  a  work  in  progress,  and  has  yet  to  be  fiilly  implemented  in 
the  system. 

5.  Color  Segineutatiou  for  Skiu  Detection 

Once  foreground  regions  are  segmented,  these  regions  are  put  through  further 
processing  to  detemiine  where  skin  is  located  in  the  scene.  Skin  detection  is  based  on  the 
principle  that  in  certain  color  planes,  skin  values  occupy  a  ven^  narrow  range  cf  values, 
even  when  different  races  are  considered.  After  extensive  testing,  the  LIA"  co'or  space 
w'as  chosen  because  it  allowed  for  the  most  accurate  segmentation,  while  still  usmg 
integer  values  (w-hich  allo^vs  for  a  fast  processing  technique  that  wall  be  shown  later).  A 
pixel  in  the  LUY  color  plane  contains  three  values;  one  for  luminance  (brightness),  and 
uvo  for  chroniinance  (color).  Experimental  testing  showed  that  a  pixel  at  location  (x,y) 
could  be  reliably  called  skin  tone  if  it  satisfied  equation  (6-1). 

(140  <  <  230)  n  (1 00  <  <  1 1 8)  n  (140  <  <158)  (6-1) 

In  order  to  implement  the  skin  segmentation  in  a  manner  as  efficient  as  possible, 
the  number  of  comparisons  to  detennme  if  a  pixel  is  skin  must  be  kept  as  low  as  possible. 
This  IS  necessary^  because  every  video  frame  contains  640x480  pixels,  or  307,200  pixels. 
If  equation  (5-3)  \vere  implemented  in  a  brute  force  approach,  each  pixel  w'ould  require  8 
comparisons,  or  2,457,600  comparisons  per  frame.  This  proved  to  be  too  slow,  since  30 
frames  per  second  were  being  processed,  so  a  new-  faster  approach  was  implemented.  Li 
this  approach,  the  range  of  val-.ies  that  defined  as  skin  are  put  in  three  data  structures,  one 
for  each  portion  of  the  color  space.  A  graphical  representation  of  these  structures  is  seen 
in  Figure  8.  The  structures  contain  a  logical  1  or  0  (corresponding  to  whether  or  not  it 


69 


falls  in  the  lange  for  skm)  for  every  possible  L,  and  V  \Tiliie  (integers  with  value  0  to 
255).  These  structures  allow  for  much  faster  skin  detections^  as  it  only  requires  three 
quick  operations  to  determine  if  a  pixel  lies  within  the  skin  region.  This  reduces 

the  maximum  number  of  computations  per  frame  from  2^457^600  to  921,600.  In  trials, 
this  amounted  to  a  reduction  of  about  .05  seconds  per  frame  (a  significant  amount  of 
processing  time  saved  for  real  time  video). 


L  U  V 


0 

0 

1 

a 

3 

a 

138 

0 

130 

a 

W) 

1 

141 

1 

141 

-I 

1 

i- 

-I 

m 

i* 

* 

1 

m 

1 

231 

0 

232 

a 

m 

d 

2G6 

a 

Q 

0 

1 

0 

1 

0 

i 

138 

i 

d 

13« 

0 

140 

1 

141 

1 

142 

1 

1^ 

137 

1 

■ 

1 

1G8 

1 

isa 

a 

100 

0 

204 

6 

200 

0 

0 

1 

0 

2 

¥ 

0 

* 

d 

w 

0 

100 

1 

101 

1 

102 

+ 

1 

■ 

+ 

117 

■ 

■ 

1 

118 

1 

11fl 

Q 

120 

+ 

0 

d 

29S 

0 

Figure  8:  Skin  Segmentation  Data  Sftiictnre 

Figure  9  shows  the  motion  detected  image  from  the  previous  section  on  the  left, 
and  on  the  right  a  binary  image  which  shows  in  white  where  skin  has  been  detected.  As 
can  be  seen,  there  are  some  errors  in  the  system,  as  the  colors  of  certain  background  areas 
fell  w'ithin  the  range  of  values  specified  as  skin  tones.  This  is  not  a  significant  problem, 
and  will  be  solved  once  the  new"  backgroimd  segmentation  code  is  implemented.  Aside 
from  that  one  error.  Figure  9  show"s  that  at  this  point  the  video  frame  has  been  reduced  to 


70 

only  two  types  of  objects;  hands  and  faces.  Thi^  is  where  the  next  step  in  the  algorithm, 
the  neixral  network  classifier,  comes  into  play. 


Figure  9:  Skio  Segmentation 

6.  Neui  al  Network  for  Wiiirtow  ClassiHcfitiou 

At  this  point  the  video  frame  has  been  reduced  to  a  few  windows  whicli  contain 
either  faces  or  other  objects  (such  as  hands).  Once  these  windows  are  extracted,  they 
must  be  classified  as  either  faces  or  non-faces.  To  do  this  a  multilayer  perception  neural 
network  classifier  is  employed.  Neural  networks,  named  after  their  resemblance  in 
functionality  to  the  human  nervous  system,  consist  of  large  numbers  of  simple  processing 
elements  called  neiuous,  which  exhibit  complex  global  behaWor  based  on  their 
connections  and  individual  neuron  parameters.  Neiual  networks  are  used  primarily  to 
solve  problems  for  which  no  mathematically  optinval  solution  exists;  such  as  determining 
whether  or  not  an  image  contains  a  face, 


OUTPUT 

TEACHING  IWPUT 


Figui  e  10:  A  » 111  on 


71 


Each  individual  neuron  in  a  neural  network  i5  simple  in  nature;  it  consists  of  a 
siunniation  and  a  tliresholding  function.  An  example  of  a  neuron  shown  in  Figure  10. 
Each  input  Xj  is  multiplied  by  a  weight  Wj  and  these  products  are  summed  in  the  neuron. 
This  value  is  then  thresholded  by  a  signioid  function,  and  the  output,  either  a  one  or  zero, 
is  sent  to  its  output,  which  may  be  the  next  layer  of  node 5.  The  ty^pe  of  neural  network 
employed  in  this  research  is  called  a  multilayer  perceptron  feed- forward  nettvork.  This 
type  of  neural  network  is  characterized  by  two  principal  elements;  all  of  the  inputs  are 
connected  to  every^  hidden  node,  and  inforniation  only  goes  forward  (i.e.  there  is  no 
feedback).  This  type  of  network  was  chosen  because  of  its  speed;  since  there  is  no 
feedback,  all  computations  can  occur  simultaneously,  allowing  very  fast  classification. 
An  example  of  a  feed  forward  network  similar  to  the  one  used  is  located  m  Figure  1 1 . 


72 


Input  Layer 


Input  1 


Input  2 


Input  3 


Input  4 


Input  n-2 


Input  n-1 


Input  n 


Hidden  Layer 


Output  Layer 


ligui  e  11:  A  Feed  Foi'^rard  Neuial  Nenvoi  k 


Figure  1 1  shows  a  neural  network  diat  has  n  iiipiit  values  presented  at  a  time  to 
each  of  four  neurons  in  a  "  hiddeif "  layer.  The  oittput  of  these  hidden  neurons  is  passed  to 
each  of  Uvo  neurons  in  the  output  layer. 

As  discussed  earlier,  each  neuron  or  '"node”  has  \veights  which  determine  the 
output  for  any  given  set  of  inputs.  In  order  to  detennine  the  values  of  the  weights,  and 
tlius  get  the  correct  output,  the  neuvork  must  be  trained.  This  is  accoii^lished  by  building 
a  set  of  training  data,  which  contains  tliousands  of  examples  of  faces  and  non- faces.  This 
training  data  set  was  built  by  recording  -^^ideo  of  people  walking  doixTi  a  hallway,  and 
tlien  using  the  previous  tw^o  steps,  motion  segmentation  and  flesh  detection  to  extract 


73 


windows.  These  windows  were  then  all  mauiially  classified  hy  the  author  as  faces  or  non- 
faces.  In  order  for  the  windows  to  be  processed  by  the  network  code,  they  inu5t  be 
changed  from  image  matrices  to  oue-diraensional  vectors  as  illustrated  in  Figure  12. 

Once  each  window  is  classified  and  transformed  into  a  vector,  a  matrix  is  created 
in  w'hich  each  row  an  represents  a  u  mdow's  vector,  with  ihe  fir^t  colutnn  containing 
the  correct  classification  (a  1  for  face,  or  a  2  for  non-face).  This  matrix  is  then  used 
iteratively  to  find  the  best  weights  for  the  hidden  nodes  so  that  the  neural  net^vork  obtains 
the  best  classification  performance  possible  (since  the  correct  classification  is  know'n 
during  training).  One  of  the  key  variables  in  training  is  the  number  of  hidden  nodes; 
various  numbers  of  nodes  were  tested  in  order  to  detemmie  the  nimiber  of  hidden  nodes 
that  gave  the  best  accuracy.  Six  hidden  nodes  worked  the  best,  giving  fast  computation 
times  and  over  96%  accuracy  on  the  training  set. 

One  of  the  other  key  variables  is  the  number  of  input  values  passed  into  the  neural 
netw  ork  for  each  classification  (that  is,  number  of  elements  in  each  input  vector).  For  this 
research,  this  amounted  to  the  number  of  pixels  used  for  a  face  window.  Various  window^ 
sizes  were  tested,  ranging  from  as  small  as  6x6  up  to  30x30.  It  was  found  that  increasing 
resolution  past  12x8  actually  resulted  in  a  decrease  in  performance  (both  in  efficiency 
and  accuracy).  WTiile  at  first  this  may  seem  coimter  intuith  e,  it  is  a  common  occurrence 
in  neural  networks.  It  is  primarily  due  to  the  fact  that  faces  only  have  so  many 
distinguishing  featuies  common  to  all  of  them;  a  round  shape,  light  areas  on  the  nose  and 
forehead,  and  dark  areas  around  the  eyes  and  mouth.  Once  the  resoUition  is  increased 
beyond  a  certain  point,  the  extra  information  prov  ided  by  the  extra  pixels  is  spurious  at 
best,  and  often  detrimental  to  performance.  This  entire  process  is  illustrated  in  Figure  12. 


74 


Roduca 
JniBtfa  to 
12x8 


Tranaform 
Image  Matrix 
to  Vector 


Figui’e  12:  Rewing  and  XrfiiisfoiTmnti&D 


7.  Overall  Operatiou 

Once  a  window  is  classified  as  a  face,  pertinent  information  (centroid  location, 
size,  and  speed)  abotJt  the  window  is  stored  in  a  data  stmcnire.  Only  location  and  size  are 
stored  in  the  initial  detection,  with  speed  being  stored  after  the  same  face  is  detected  in 
four  successive  frames.  Speed  is  detemained  by  a  weighted  average  of  differences 
between  the  centroids  from  successive  ftames.  The  equation  for  doing  this  is  shown  in 
equation  whore  jj=i  is  the  most  recent  firomc^  find  arc  the  weight  average 


constants. 


75 


"(I  ^.*1  -^?J+]  |) 

FZ-1 

<8-i) 

?J-] 

K  =  [X-X-2-\ 

The  weighted  average  allow?  for  change:?  in  speed,  while  preventing  excessive 
spikes  due  to  errors  in  window  centering.  Once  a  speed  is  determined,  the  face  is  queued 
for  capuire.  When  the  face  reaches  the  beginning  of  the  queue,  the  pan-' tilt  mount  is 
moved  to  its  next  location  (as  predicted  by  the  facets  speed)^  and  a  high  resolution  image 
is  taken.  This  method  is  particularly  robust  to  error  because  a  window  i?  only  queued  for 
imaging  if  it  occurs  four  times  in  a  row.  For  ?ometbing  other  then  a  face  to  be  queued,  it 
would  have  to  be  fahely  detected  in  four  succe^sme  frame?,  which  testing  has  ?bouiii  is 
extremely  unlikely.  A  flow  chart  of  the  entire  system  operating  is  shown  ni  Figure  13. 
The  entire  process  is  run  from  one  MATLAB  graphical  user  interface.  The  current 
version  of  the  code  encompasses  over  50  files,  containing  approximately  5000  lines  of 


code. 


76 


Neural  INetwoik 

Classifier 


Ciaaatflfld  as  Faos 

Stone  Location 
Size,  Speed  tc 
Track 

Figure  U:  Oa  ei  all  Outliue  of  Sy^fein  Operation 

8.  System  Testing 

Dw  t<i  recent  eventi  involving  human  subject  testing  i$si.ie$.  testing  on  controlled 
biiTnan  ce^atip^d  Tluk  will  he^  in  thp  Spring  ,  with  ?.  !>hiiftfir>  ofhe^r  ohj?ctf> 

iiecessarv^  [due  to  human  subject  testing  concerns).  Regandless  of  what  the  system  is 
being  tested  on,  final  results  w'ill  include  measures  of  response  times,  capture  rates,  and 
system  errors. 

Response  Times 

The  system  has  two  different  response  times.  The  first  is  a  measure  of  how  long  it 
takes  tlie  system  to  find  an  object  of  interest  once  it  enters  ihe  scene.  This  is  important 
primarily  in  scenarios  where  objects  are  present  in  the  scene  for  only  a  short  amount  of 
time.  The  other  measure  cf  response  time  is  how  long  it  takes  to  capture  a  high  resolution 


image  of  an  object,  once  it  has  been  detected.  This  is  one  of  the  most  important  variables 


77 


ro  measure^  since  it  determines  how  many  objects  can  be  in  the  scene  before  they  begin  to 
get  missed  due  to  the  system  being  too  slow. 

Capture  Rates 

This  %ariable  measures  the  time  benveen  successive  captures  with  the  high 
resolution  camera.  This  is  measured  b>"  finding  the  time  between  captures  when  the 
system  is  presented  ivith  many  detections. 

System  Ei'rors 

Errors  come  in  two  main  forms;  capture  errors  and  classification  errors.  Capture 
errors  occur  when  the  system  correctly  detects  an  object,  but  then  fails  to  take  the  high 
resolution  image  of  the  object.  This  t>pe  of  error  would  be  primarily  due  to  bad 
predictions  of  future  locations. 

Classification  errors  come  in  two  forms;  false  positives  and  false  negatives.  False 
positives  occur  when  the  system  classifies  something  as  being  an  object  of  an  interest 
when  in  fact  it  is  not,  False  negatives  occur  when  the  system  fails  to  correctly  classifS'  an 
object  of  interest.  False  positives  and  false  negatives  are  the  most  reliable  indicators  of 
the  performance  of  the  neural  network  classifier. 

9.  Couclusions 

Biometric  identification,  public  iur\‘eillance,  aitd  mimEmned  vehicle  systems  all 
ciirrenTly  siiifer  from  cue  common  handicap;  their  use  of  low  resolution  video.  This 
severely  revtncts  their  ability  to  accurately  achieve  their  respective  goals.  The  couclu^ion 
of  tills  research  will  be  a  functioning  system  That  addresses  this  problem  by  allotviiig  the 
accurate  and  efficient  capture  of  high  resolution  images  of  objects.  The  ability^  to  defect 


78 


any  of  predefined  object  and  autonomous  namre  of  the  end-system  will  allow  for 
simple  application  to  all  three  of  these  fields,  and  more. 

10.  Spiiiig  Semestei*  Deliveiabies 

Due  to  the  liiitnaii  subject  testing  issues,  the  spring  deliverables  may  change,  but 
currently  the  goal  is  the  complete  s\^tem  that  is  able  to  be  rapidly  adjusted  to  varying 
applications.  In  order  for  this  to  be  accomplished,  there  are  three  mam  areas  that  must  be 
addressed.  Tlie  first  is  the  de\elopment  of  the  background  segmentation  algorithm 
discussed  in  Section  4.  WTiile  the  current  method  of  motion  detection  works  well  enough 
in  controlled  enviromnents.  a  more  robtist  algorithm  is  neces^ry  for  the  system  to  be 
useful  111  uncontrolled  reabworld  em’ironments.  The  second  deliverable  is  the 
development  of  an  efficient  way  to  adjust  the  system's  classification  algoritlnn  to 
different  applications.  Currently  this  process  is  extremely  time  intensive,  and  requires 
large  amounts  of  human  interaction.  The  final  version  of  the  system  will  allow  for 
autoinization  of  most  of  this  process.  The  final  dehverable  consists  of  all  the  testing 
discussed  in  Section  S  and  corresponding  analysis.  Thi^  will  allow  us  to  evaluate  the 
system's  overall  perfomiance  and  viability  in  different  real-world  applications 


79 


11,  Works  Cited 

[1]  “Human  Research  Protection  Program",  SECNAV Immtctioft  SPOOJflD,  2006 

Nov  3.  Available  HTTP:  lirip://w'w\v.oiif  .na\'>’.tMil/£ci_iecb/34/docV 
secna  vinit_39 00_3  9d.pdf 

[2]  P.  Hallinan,  G.G.  Gordon,  A.L.  Yuille,  P.  Giblin  and  D.  Mumford,  Two  md 

Three  Dimensional  PaTtems  of  the  Face,  Natick:  A  K  Peters,  1999. 

[3]  W.Y.  Zhao,  “Image-Based  Face  Recoguition:  Methods  and  Issues”,  [Online 

Document],  cited  2006  Jan  14,  Available  HTTP:  httpi/'Sv’ft’W.face- 

rec.org/iiiTereiTiiig-papers  General  Chapter_figiiie, pdf 

[4]  C.  Wren.  A.  Azabayejani,  T.  Darrel,  and  A.  Pentlaud,  “Pfnider:  Real-time 

tracking  of  tlie  human  body,"  IEEE  Transacnons  on  Pattern  Analysis  and 
Machine  Intelligence  7.9,  pp.  7S0-785,  July  1997. 


[5]  N.  Friedman  and  S.  Russell,  “Image  segmentadoa  in  Gdeo  sequences:  A 

probabilistic  approach."  Proceedings  of  the  Thirteenth  Annual  Conference  on 
Uncertainty  fn  Artificial  Intelligence  (UAI-97),  pp,  175-181,  Morgan 
K^ufiuatiii  Publishers,  [nc,.  (San  Francisco.  CA),  1997, 

[6]  Q.  Zhou  and  J.  Agganval,  ■■Tracking  and  classifying  moving  objects  from 

videos,"  Proceedings^  of  IEEE  Workshop  on  Peifonnance  Evaluation  of 
Tracking  and  Sw'yeiUance^  2001. 


80 


12.  \\^rks  Consulted 

1.  B.D.  Ripley.  Partem  Recognition  and  Neural  Networks-  Cambridge:  Cambridge 

University"  Press.  1996. 

2.  T..  Masters,  Signal  and  Image  Processing  with  Neural  Networks.  New  York:  Wiley, 

1994. 


3.  L_  Li.  W.  Huang,  I.  Yii-Hiia  Gii,  and  Q.  Tian,  "  Statistical  Modeling  of  Complex 

Background?  for  Foreground  Object  Detection/'  IEEE  Transactions  On  Image 
Processing,  vol.  13,  no.  11,  Nov.,  pp.  1459-1472. 

4.  R.  Diida.  P.  Hart  and  D.  Stork.  Pattern  Classification  2^  Ed.,  New  York:  John  Wiley 

Euid  Sons,  2001 . 


